=== modified file 'src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml' --- src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml 2012-02-08 22:35:01 +0000 +++ src/docbkx/en/dhis2_implementation_guide_data_warehouse.xml 2012-03-13 12:55:12 +0000 @@ -7,7 +7,7 @@ Data warehouses and operational systems A data warehouse is commonly understood as a database used for analysis. Typically data is uploaded from various operational / transactional systems. Before data is loaded into the data warehouse it usually goes through various stages where it is cleaned for anomalies and redundancy and transformed to conform with the overall structure of the integrated database. Data is then made available for use by analysis, also known under terms such as data mining and online analytical processing. The data warehouse design is optimized for speed of data retrieval and analysis. To improve performance the data storage is often redundant in the sense that the data is stored both in its most granular form and in an aggregated (summarized) form. A transactional system (or operational system from a data warehouse perspective) is a system that collects, stores and modifies low level data. This system is typically used on a day-to-day basis for data entry and validation. The design is optimized for fast insert and update performance. - + There are several benefits of maintaining a data warehouse, some of them being: @@ -46,16 +46,15 @@
- Aggregation strategies in DHIS 2 - DHIS 2 is designed to run in low-end environments which puts certain restrictions on the performance. Two strategies for aggregation of data is offered: Real-time aggregation means that the system will generate aggregated data on-the-fly based on the low-level data every time a report is requested. This implies that the aggregate data will reflect the the very latest captured data and is useful if producing reports immediately after data entry has been done is a priority. The downside is that this will not perform adequately on an online server where the database contains a large number of records and there is high user concurrency. - Batch aggregation means that the system will generate aggregated data every night for a defined time-span (typically the last two years) based on the low-level data and write this data to a data mart. A data mart is a data store optimized for meeting the most common user requests for data analysis. The DHIS 2 data mart contains data aggregated in the space dimension (the organisation unit hierarchy), time dimension (over multiple periods) and for indicator formulas (mathematical expressions including data elements). This strategy for aggregation provides great performance even in high-concurrency environments since most requests for analysis can be served with a single, simple database query against the data mart. The aggregation engine in DHIS 2 is capable of processing low-level data in the multi-millions and manage most national-level databases, and it can be said to provide near real-time access to aggregate data. The downside of this approach is that captured data will not be available for aggregated analysis until the next day. However, for a routine system like DHIS 2 where data is typically collected with a monthly periodicity this is not a significant problem. - Hint: The aggregation strategy can be set in “Settings” - “System settings”, while scheduling of data mart exports can be enabled in “Reporting” - “Scheduling”. + Aggregation strategy in DHIS 2 + The analysis tools in DHIS 2 reads aggregated data from data mart tables. A data mart is a data store optimized for meeting the most common user requests for data analysis. The DHIS 2 data mart contains data aggregated in the space dimension (the organisation unit hierarchy), time dimension (over multiple periods) and for indicator formulas (mathematical expressions including data elements). Retrieving data directly from data marts provides good performance even in high-concurrency environments since most requests for analysis can be served with a single, simple database query against the data mart. The aggregation engine in DHIS 2 is capable of processing low-level data in the multi-millions and manage most national-level databases, and it can be said to provide near real-time access to aggregate data. + DHIS 2 allows for setting up scheduled aggregation tasks which typically will refresh and populate the data mart with aggregated data every night. You can choose between aggregating data for the last 12 months every night, or aggregate data for the last 6 months everry night and the last 6 to 12 months every Saturday. The scheduled tasks can be configured under "Scheduling" in "Data administration" module. It is also possible to execute arbitrary data mart tasks under "Data mart" in "Reports" module.
Data storage approach There are two leading approaches for storing data in a data warehouse, namely the normalized and dimensional approach. DHIS 2 lends a bit from the former but mostly from the latter. In the dimensional approach the data is partitioned into dimensions and facts. Facts generally refers to transactional numeric data while dimensions are the reference data that gives context and meaning to the data. The strict rules of this approach makes it easy for users to understand the data warehouse structure and provides for good performance since few tables must be combined to produce meaningful analysis, while it on the other hand might make the system less flexible and harder to change. In DHIS the facts corresponds to the data value object in the data model. The data value captures data as numbers, yes/no or text. The compulsory dimensions which give meaning to the facts are the data element, organisation unit hierarchy and period dimensions. These dimensions are referred to as compulsory since they must be provided for all stored data records. DHIS 2 also has a custom dimensional model which makes it possible to represent any kind of dimensionality. This model must be defined prior to data capture. DHIS 2 also has a flexible model of groups and group sets which makes it possible to add custom dimensionality to the compulsory dimensions after data capture has taken place. You can read more about dimensionality in DHIS 2 in the chapter by the same name. - +