Clinical Dashboard Data Warehouse Design Guide
By System C Healthcare PLC | 2011
This article provides information to help support you in developing your own dashboard, or to enhance the functionality of an existing dashboard.
There are a great many potential information sources and local variations in requirements and operational processes. This article is limited to a high level description of the implementation process. The primary recommendation is that a full requirements analysis is performed and this article be used in conjunction with your local IT strategy to determine an appropriate local implementation.
Figure 1 below highlights the processes involved in extracting, translating, loading and collating data to drive a clinical dashboard.
The NHS has a huge variety of data collection systems which often dictate the approach to extract, transform and load data, so an analysis phase (of both data and available skill sets / resources) will be required to determine the most appropriate local solution. Due to these variations in local requirements, strategies and resourcing advice or guidance on how to design or implement an appropriate ETL (Extract, Transform, Load) layer for your site / requirements is beyond the scope of this article.
A number of decisions will influence process and design choices throughout the dashboard design and implementation and a selection of key decisions are listed below. Your local IT strategy, in conjunction with available resources will be a key source of guidance on the approaches to be used.
Data can be loaded into a dashboard in a number of ways – either ‘raw’ granular patient / activity based data which is collated, aggregated and local business logic applied within the dashboard after it has been loaded, or simplified summary data can be loaded.
Data can be taken directly from existing warehouses and repositories in either summary or granular form and used to directly drive metrics on the dashboard or from extracts available from the individual source systems.
Local IT strategy and systems can dictate the design and availability of data feeds in to, or out of any existing systems, as well as their formats and frequencies.
The primary source of advice and guidance on the design and implementation of the database systems, schemas or ETL layer(s) to be used should be your own local IT Strategy document, in conjunction with any local data warehouse / repository documentation. Database design is a complicated and involved process and is heavily driven by the data presented to the system and the requirements for extraction. It will involve a number of compromises between a ‘pure’ design and performance.
Earlier design decisions such as granular activity vs. summary data, or direct feeds vs. text files will heavily influence the required database design and have an impact on infrastructure related choices, such as hardware, operating system, database system etc.
It is recommended that the system is implemented using a star or snowflake schema, on a multi-dimensional or relational database where data is not extracted directly from an existing local data warehouse / repository for presentation.
There should be a series of clearly defined stages during the import or processing of the source data, each performing a clearly identified and separate function where data flows from an initial staging area through to the final presentation layers.
There are a large number of local factors which will heavily influence the design, structure and infrastructure requirements of the database and whether an independent database is even required to support your specific dashboard implementation. Analysis and requirements gathering phases should be undertaken prior to the commencement of any specific implementation work to determine the most appropriate course of action for your local implementation, infrastructure and resource / skill mix.
However, a simple example, based on the ‘Generic Clinical Events’ text feed is described below, which covers some of the design decisions and elements of the flow of data through the various staging areas.
The first design step is to under take a process known as ‘normalisation’ – this is simply identifying repeated or related elements of data and placing these into separate tables. Examples of these are the source / destination organisation codes / names, External Patient Identifier, Parent Episode Identifier and the various date fields.
If we first consider the organisation data, this could be used to populate (using insert / update logic decisions) a reference table containing all of the organisation codes and descriptions. References to both the code and description would be replaced with a single key field, derived from the insert / update logic. Parent Episode and External Patient Identifier fields are in themselves ‘foreign keys’ which could be used to reference either another activity feed or the GCE PMI reference feed. This normalised data is the basis for the database design to hold the data during the late stages of this process.
These reference tables (which are generally built and maintained during the insert / update processes) can be supplemented with pre-built reference tables for handling dates – these could provide pre-calculated derived fields, such as day of week, flags indicating if the date is a weekend flags, financial periods etc.
Data Extracts from the source system – this is beyond the scope of this article, due to the local nature of this process.
Initial Staging Area – This contains the file ‘as is’ and is used as a holding area to determine whether records should be inserted or updated. Entry into this layer should be through a number of ‘data quality’ gateways to ensure that erroneous or missing data is rejected or handled appropriately.
Loaded Raw Data – This contains the file ‘as is’ and holds one row per unique record presented to the system.
Archive Data – All data presented to the system is stored in an archive table.
Normalised Data – This will contain a single table and a number of reference tables. If this were to be viewed diagrammatically, it would resemble a star or snowflake, with a central table and a number of related tables surrounding it. Each related table, may reference a number of additional tables. Data structured in this manner can be used to produce / populate OLAP style cubes.
Aggregate Data Layer – For speed and simplicity, each metric is pre-calculated as a final step in the data load process and this result is placed into a simple table – this allows an amount of abstraction between the raw source data and the front end display and greatly increases the speed of which metrics can be displayed. Performance can be greatly increased by simplifying this structure into a non normalised structure and this would be recommended. An example of a simplified aggregate data layer structure is shown below in Table 1.
Your own local requirements may result in a more complex aggregate data structure, or may conclude that you do not need one at all.
Local implementation and design decisions may require the use of different front end components and display environments. For example, a SSRS report object has a number of grouped metrics / measures, which may not meet the local requirements, due to their positioning or grouping, or Dundas Gauges (or even SharePoint itself) may not be a preferred software solution or comply with the local IT strategy. As an example Business Objects or Cognos tools maybe used to develop dashboards. If these components can simply be lifted onto a local dashboard, there will be a number of localisations that will be required ranging from ensuring RAG ranges are appropriate for the specific organisations needs through to ensuring they can be driven from the locally implemented data sources.
It may be that local skill mix, resource availability or IT strategy drives the presentation of the dashboard components towards using a directly programmed approach using, for example, Java or C#, rather than the more ‘end user’ orientated tools such as SharePoint and Dundas. As stated previously, a full analysis should be performed to determine the most appropriate local solution before progressing down a specific implementation path.
A key element to delivering a successful dashboard implementation is to ensure the data displayed accurately reflects the data presented to the system and any intermediate processing reflects the expected business logic.
The method used for data reconciliation is highly dependent upon the decisions taken around the presentation and processing of source data, however a high level example assuming the use of summary data is shown below.
A simple test that will give you confidence that data is being loaded and processed appropriately is to track the external keys of records through the various stages (archive, stage 1 and stage 2). It should be noted that this external key may not be used as a primary key for the tables in the various stages and the example below assumes this to be the case.
If using the ‘GCE’ feed, each record will contain an external key generated during the production of the feed. This key should be unique to each record and a simple count of occurrences within the stage 1 and stage 2 tables will indicate if the insert / update approach taken is working appropriately. If more than one instance of this key is found in each table, then this may indicate a problem with the insert / update logic used. It should be noted that many bulk insert approaches may not ‘collapse’ multiple instances of a record present in a source file into a single instance due to the manner records are committed.
Simple queries can be written to ensure that all keys present in one staging area exist in all others. If keys are not populating across all staging areas then this suggests an issue with the insert / update logic.
Once you have established the external keys are populating across stages appropriately, the next step may be to check that entire records are populating appropriately.
This process is to simply check that every field in a random sample of records from each data load are consistent across the various staging tables used and reconcile to the source data presented.
Record Set Tracking
Once you have established that entire records can be observed to be populating across the various staging tables, the next step would be to check an entire record set (or data submission).
This would utilize the same approach for checking the individual records, but would be performed on an entire record set, checking the total number of records loaded, as well as each field.
It is key that the above processes are multiple times and with multiple files. A suggested strategy is shown below:
- Empty staging tables, initial test data file. Test as appropriate.
- Repeat load of initial test data file. Repeat tests.
- Load second test data file, which contains a subset of the initial data file. Repeat tests.
Once you have a degree of confidence that the data is being loaded appropriately, the next phase is to reconcile the presentation of the data and any collation, aggregation or business logic applied during the later stages of processing.
There is no set process to do this, as it is highly dependent upon the manner in which data is presented to the system and the processing performed but the approach outlined below may be useful.
Source data is loaded into Excel (for example) and manually manipulated to mimic the collation / processing implemented in the dashboard and this is then compared to the final gauge, report or graph outputs. Ideally this work would be performed by an individual who was not involved with the original development, to test the interpretation of the business logic / requirements as well as their implementation.
Ideally the above steps will be repeated three times, using a small set of known data, a set representative of a live data load and finally with a live data load to give the required level of confidence that the dashboard is working as expected. This step can be particularly useful in highlighting issues with the interpretation of a given metric, as well identifying areas where further analysis may be required to confirm the business logic to be used, or inconsistencies between data sources etc.
The design and implementation of a new dashboard, or extensions to an existing dashboard is a complex undertaking, requiring significant clinical and information support and should be guided by the existing IT strategy documents, as well as any local data warehouse / repository documentation.
A number of key decisions need to be explored before any conclusions can be made on a local level as to the most appropriate design and implementation approach and a number of these are summarized below. Your local analysis phase may well identify a number of additional issues.
- Direct linkage of the front end presentation components vs. ETL processes
- Database creation / revision and schema design
- Availability of creation / modification dates in the source systems
- Availability of robust keys to drive insert / update logic
- Ease of expansion to existing warehouses / repositories
- Resource / skill mix availability
- Granular (patient / episodic) data vs. summary data
- Frequency / availability of data
- Front end presentation approach
- Merging with existing Intranet / SharePoint deployments
- Availability of appropriate reporting components (e.g. Dundas and SSRS)
Was this article useful?13