Browse and search developer information

Prediction Tool Development

By Connecting for Health | 2012

Model Choice

A key decision to make, when either selecting or building a prediction tool, is which prediction model to use. There are a variety of well known models available such as PARR, CPM, LACE and PEONY as well as many variants of these. There are also a growing number of “bespoke” local models being generated by model builders for specific local baseline populations.

Irrespective of model type and model builder the following key criteria should be used in selecting a model:

  • Outcome – what outcomes do you want to predict. Different models may predict different outcomes; there is no point in choosing a model that does not predict what you are interested in.
  • Training dataset – what was the scope of the data used to create the model. This will include baseline population and data sources. This will determine the applicability of the model to your requirements. For example the PARR model used a baseline population of all acute patients in England, as such it did not include all GP registered patients or any GP data and therefore would not be considered suitable for predictions for primary care populations.
  • Internal validity – what evidence is there that a model is internally robust? Is there a report?
  • External validity – what evidence is there that a model is externally accurate? Is it more or less accurate than similar models?
  • Maintenance – when was the model created? What arrangements are in place to maintain the model? Unfortunately most models do not seem to have maintenance arrangements in place, as many are produced by research groups as a one off exercise.
  • Representation – how is the model documented/represented? Are there any constraints on access to the representation or use of it to implement your own predictive tooling? Note some commercial produced and supported predictive models are treated by vendors as proprietary and although they are happy to share internal and external validity reports will not provide a representation of the model.
  • Data input – what input variables, predictors, does a model require to make a prediction? You will need to source and process the correct data to feed into a model. Do you have access to the appropriate data for a model? Is it of good enough quality to produce accurate predictions? How difficult will it be to process the source data into the model inputs?

You may find based on these criteria that you cannot find an existing predictive model that fits your requirements adequately. You will then have to decide to either compromise on your requirements or look at building your own “bespoke” model.

Marshalling Input Data

Any prediction tooling implementing a predictive model will require input data as predictor values to be able to calculate and output an outcome probability. The model description will detail what each predictor represents, its data type and where coded which code set it uses and which code values must be used. All this information should be available in the model representation.

In most organisations the required input data will not be readily available in the exact structure and format required by the model. Therefore a process often called data marshalling needs to be implemented. This is very similar to the process used to create a training dataset:

  • Identify and document appropriate data sources
  • Address any data sharing issues needed to gain access to source data
  • Check the quality of data and clean if necessary
  • Link data to create single input records for the predictive tool and check linkage quality
  • Transform data in single input records where appropriate to match model predictor data type and code set requirements

As for model building it is recommended that a RDBMS is used for data storage and management of the data marshalling process.

Data sources will include an appropriate baseline population source as well as data sources for health and social care records. An important issue to consider with data sources being used to marshal input data is one of latency. In the simplest case a data source is a direct provider, for example a GP Practice, where there is minimal delay or latency between information changing about an individual, that information being recorded on the provider’s information system and it being subsequently available for access as a data source. However where a data source is an indirect provider, for example a centralised information broker such as SUS, there will be a latency between the changed information being recorded on the provider’s information system and subsequently being available on the indirect provider’s information system for access as a data source. For example the indirect provider may schedule updates to its information system from the direct provider feeds on a monthly basis. The situation becomes more complicated with multiple data sources with different latencies. It is often useful to create a scheduling diagram which sets out the data sources and their update schedules and latencies. This will help you decide when it is valid to calculate predictions.

Using data sources to create input data for a predictive tool will involve sharing data between organisations within and across health and social care. Such data sharing will require careful consideration of IG obligations, constraints and controls.

The quality of input data will affect the accuracy of any predictions calculated by the prediction tool. You should therefore check the quality of your input data. Where appropriate and feasible you should also clean the data if there are data quality issues. Note however prediction tools normally handle missing values, so you normally do not need to use imputation.

The data taken from different data sources will relate to specific individuals. To construct input records for the prediction tool these data need to be linked together so that all relevant data items used as predictors that belong to a single individual are collated.

Having assembled input records for the prediction tool, these often require a final transformation of individual data items before they are compatible with the expected predictor data type and code set requirements.

Data type transformation is relatively straight forward. For example a collated data item maybe an integer but needs to be transformed to a string to be compatible with the expected data type for the predictor.

Code transformation, often called a code crosswalk, can be more problematic. A data item derived from data sources maybe coded using one code set, but the predictor requires values from another code set. To transform the code values you need to set up a code crosswalk which maps each individual code value from the source code set to the predictor code set. The crosswalk may map more than one source code value to the same predictor code value or in some circumstances no valid code mapping can be made.

The level of granularity as well as the semantic meaning in different code sets can make code crosswalks difficult to create. For example in the health care domain ICD-9 and ICD-10 codes (International Statistical Classification of Diseases and Related Health Problems) are often used by model builders to code mining attributes that represent disease diagnoses and treatments in a training dataset. Some prediction tool users will use HRG (Healthcare Resource Group) codes from centralised data sources, such as SUS. These represent standard groupings of clinically similar treatments which use common levels of healthcare resource. Therefore multiple ICD-9/10 codes will map to a single HRG code and conversely a single HRG code will map to multiple ICD-9/10 codes.

It is recommended that the data transformation rules and code crosswalks are formally documented.

Data transformation can be implemented in a variety of ways. However the most common approach is to build on the staging tables within a database. The data transformation logic is implemented in bespoke code. The code traverses the staging database table containing the assembled prediction tool data. For each record it finds it transforms the appropriate data items. The transformed set of records is written out to a separate database table that will hold the prediction tool input data. The detailed results of the transformation are written to some form of data transformation report. The code can be implemented either within the database itself, as for example by stored procedures, or in an external application, for example a Java or .NET application.

Figure 1 - Data Marshalling.

Figure 1 – Data Marshalling.

Data marshalling is likely to be a process that is both time and resource constrained, especially if the prediction process is wanted to be event driven. Therefore data storage will need to have higher levels of availability, reliability and performance than required for model building. You should therefore document the non-functional requirements for the data storage and management in detail. This should include:

  • Performance – the speed (response time) or effectiveness (throughput) of a computer, network, software program or device.
  • Reliability – the ability of a system or component to perform according to its specifications for a designated period of time.
  • Availability – the ratio of time a system or component is functional to the total time it is required or expected to function.
  • Scalability – the ability of a system to continue to function well when it is changed in size or volume in order to meet a user need.

For data marshalling server computer(s) are probably more appropriate than desktop computers. Consideration should be given to security, backup/recovery and RDBMS features.

Tool Integration

A prediction tool will support some level of integration with its environment and interoperability with other systems and services.

At a minimum a prediction tool must support:

  • A method of inputting predictor values for one or more individuals. This input should include at least one data item that is not a predictor but is used to identify the individual (supplementary attribute).
  • A method of outputting the outcome probabilities for one or more individuals. Associated with each outcome probability should be the identifier value inputted to identify the individual.

The most rudimentary methods for inputting and outputting are providing input screens to type in data and output screens to view results. Although useful for learning about a tool and for ad-hoc testing a prediction tool must support some form of data input and output to external storage.

At a minimum this should be from and to text based files. Where a RDBMS approach, as recommended has been taken for data marshalling, the RDBMS will need to export the records in the Prediction Tool Input Data table to a text file.

A prediction tool may support a direct interface to a RDBMS which will allow direct loading of input data from a database table, followed by direct storage of probability outcomes to an outcome table.

A prediction tool may support an API that allows external systems to directly input data and retrieve probability outcomes programmatically. Such an API is desirable where the prediction tool is going to be programmatically integrated with other systems, especially where the prediction tool is going to be used in a real time environment where predictions for individuals are going to be made dynamically, for example as part of a discharge summary from a hospital. There are no standardised prediction tool APIs, therefore it is recommended that from a system integration viewpoint you develop your own façade API that your systems use to interface with the prediction tool. This façade API will in turn directly use the prediction tool API. By using a façade API you decouple your systems from the specific prediction tool you are using, which then makes it easier to change the prediction tool – you only need to recode the façade API.

A prediction tool may support Web Services that also allows external systems to directly input data and retrieve probability outcomes programmatically. Web Services are similar to APIs but are however more suited to Service Orientated Architectures (SOA) where individual systems are decoupled into independent services that interface using standard internet protocols which allows physically distributed services running on disparate platforms and technologies to be easily integrated. There is no standardised prediction tool web service; therefore as for the API it is recommended that from a system integration viewpoint you develop your own façade web service that your services use to interface with the prediction tool.

Figure 2 - Prediction Tool integration.

Figure 2 – Prediction Tool integration.

When interfacing to a prediction tool consideration should be given to the model of interaction. Do you require to push or pull data to the tool? Do you require to interact synchronously or asynchronously? API and web service can support push and pull, synchronous and asynchronous interaction. File input is normally suited to pull, where the prediction tool is instructed manually or via a schedule to load a text file. File output is normally suited to push, where the prediction tool will automatically or manually save a text file.

Tool Function and Use

When either selecting or building a prediction tool the functionality and style of use of the tool needs to be considered.

A suggested minimum set of functionality for a prediction tool is:

  • Creating outcome predictions based on a predictive model and input data – sometimes referred to as the prediction engine; this is the core functionality of a prediction tool.
  • Interface(s) to accept input data.
  • Interface(s) to produce output data.
  • A user interface (UI) for both administrators and end users.
  • Access control – to control users and external system access to the prediction tool.
  • Error and audit logging.
  • Error and audit reporting.
  • Configuration management of engine, interfaces, access control and logging.

Consideration should be given to how configurable and adaptable a prediction tool is. Most tools will have the implementation of the predictive model “baked” into the product. Any changes to the model definition will then often required code changes to the product. Use of the standard predictive model representation PMML offers the opportunity of implementing highly adaptable predictive tools. The configurability of input and output interfaces to the tool is important. Interfaces that can easily be configured to input and output data in different structures will make system integration easier and cheaper.

In addition to the minimum set of functionality a prediction tool may implement additional functionality around:

  • Management – storage of outcomes so they are amenable to further analysis and display. Outcomes will be associated with an individual and there will normally be historical prediction outcomes for an individual that need to be retained to allow trend analysis.
  • Analysis – grouping (stratification), calculating description statistics, statistical inference and trend analysis of outcomes.
  • Display – graphical and text display of analysis results of outcomes and the raw outcomes both at individual and stratum levels. This may include integration with other health or social care data relating to an individual or stratum. This may include display in the context of a dashboard.
  • Distribution – provision of outcomes and analysis of outcomes to other systems or services.

This type of additional functionality is normally classified as Business Intelligence (BI) or Analytics and there are many products available (some open source) that provide this type of functionality; however some prediction tools include this additional functionality as bespoke code. This can limit their future extensibility to handle new requirements compared to the use of a dedicated BI platform. When selecting or building a prediction tool it is therefore recommended that you consider the use of a BI platform to provide the analytical functions – especially if you currently using a BI platform within your organisation to support other applications.

This additional functionality is used to implement downstream business functionality within health care such as:

  • Risk Stratification – stratification of baseline population patients into groups with similar predicted risk to help understand the case mix at both commissioner and primary care provider levels.
  • Case Finding – identifying individual patients within a baseline population who have a predicted high risk to their primary care provider so they can initiate appropriate interventions.
  • Resource Management – builds on risk stratification to understand resource implications of predicted case mix at both commissioner and primary care provider levels.

Some prediction tools do not distinguish between the prediction functionality and the downstream business functionality; for example providing a single tool for risk stratification that takes data from data sources and presents appropriate analyses and reports. Such all-in-one products initially may seem attractive in terms of reducing operational complexity and maintenance; however they often lack the flexibility to easily add new business functionality in the future and to change technology and application components easily.

Therefore from an architectural viewpoint it is recommended that the prediction tool is considered to be a separately implemented function that then integrates with separate downstream business functions (such as risk stratification).

This decoupling of the prediction functionality from the business application of the predictions will provide the maximum flexibility. In practise storage and management of the predicted outcomes is a necessary precursor for all of the other business functions. Therefore it could logically be regarded as a functional component of a prediction tool, however again to provide the maximum flexibility it is recommended this is decoupled and provisioned as a separate service.

These considerations lead to recommended target architecture for end to end predictive modelling solution as shown below in figure 3.

Figure 3 - Target solution architecture.

Figure 3 – Target solution architecture.

You need to consider how you want to use a prediction tool and match this to the capabilities of actual tools. The main determinate of use is when you need to make predictions. Making predictions can be time or event driven. Time driven prediction means you decide when you want to make predictions. For example in risk stratification it is common to build on a regular schedule (say every month) a input dataset with current predictor values for all individuals in the baseline population and then submit this dataset to a prediction tool. Event driven prediction means external events decide when you need to make a prediction. For example in an acute setting you may want an updated risk prediction made for an individual as part of their discharge process, this will be triggered by the discharge event.

In terms of data processing, tools can be used in batch and/or transactional modes. The majority of health care uses of prediction tools currently operate in batch mode. In batch mode you require the tool to be able to handle inputting and outputting large datasets, but speed of processing is usually not critical. In transaction mode you require the tool to be able to handle lots of either small datasets or individual predictions, speed of processing is now important as you will be concerned about the through-put capability of the tool. Batch mode will be important when you are using time driven prediction. Transaction mode will be important when you are using event driven prediction.

It is recommended that you assess the batch and transactional capabilities of any prediction tool against your time and/or event driven usage requirements.


Implementation of a prediction tool will involve:

  • Development/selection of a tool
  • Integration testing of the tool
  • Hosting of the tool
  • Operational management of the tool

Standard system development and implementation guidance applies.


When designing and developing a tool you or the vendor should manage and control the system development life cycle through a well-defined set of policies and procedures. This should include unit testing, functional testing and acceptance testing (including the creation of “test rig” clients and end-points).

Acceptance testing should test both the functional and non-functional aspects of the tool. For functional testing at an absolute minimum a set of test cases should be prepared that present a set of predictor values and the expected outcome probability. These test cases should be prepared independently of either the in-house team developing the tool or the external vendor supplying the tool. Note to prepare these test cases requires you to have a representation of the predictive model used by the prediction tool so you can independently calculate the outcome prediction (often unfortunately needed to be done by hand!). This implies any prediction tool provided as a “black box” with no detailed definition of what predictive model is used cannot be properly acceptance tested.

Many developers and vendors struggle with executing non-functional testing as this often involves using for example specialised performance measurement tools and rigs, and needing to instrument the application to produce the appropriate information to measure. All too often non-functional testing is relegated to a few rudimentary tests or is completely omitted. It is recommended that at least a few basic non-functional tests around capacity, performance, reliability and performance are carried out. Most of these can be simple stress tests where the system is increasingly loaded until it breaks.

Choice of programming languages and frameworks for bespoke software development should be restricted to current mainstream technologies such as Java and .NET.

Integration Testing

Following development and functional testing of a tool, a series of integration tests will be required. These tests will validate the communication of information to the tool from data sources via the data marshalling process, and from the tool to data consumers. An issue to consider here is what data to use. A small set of test cases will be adequate for basic testing; however you usually require a larger volume of data to test correct functioning of interfaces. This data may be “live data” representing data for actual individuals. If there are no security or IG risks (for example the data is anonymised or pseudonymised) with using live data for integration testing and there is no possibility that the results of processing this data are used for any health or social care purposes while integration testing, then this is acceptable. However even with these risks mitigated you may find that you are still not allowed to use live data for integration testing. You will then need to create large volumes of test data. This is best done by developing bespoke test data generators rather than attempting to craft test data by hand.

Integration testing should also provide some form of non-functional testing, although the limitations and difficulties of such testing, as outlined in the previous section, should be taken into account.


There are many options for hosting a tool. These include:

  • In-house – within your own data centre or server room
  • External public sector co-located – within an external data centre owned and run by a public sector organisation but on infrastructure dedicated to your tool
  • External public sector shared – within an external data centre owned and run by a public sector organisation but as a shared service that other organisations will use
  • External private sector co-located – within an external data centre owned and run by a private sector organisation but on infrastructure dedicated to your tool
  • External private sector shared – within an external data centre owned and run by a private sector organisation but as a shared service that other organisations will use

The choice of which hosting option to use will be driven by many factors such as:

  • Cost
  • Security and IG constraints
  • Service levels
  • Network connectivity
  • Reliability
  • Availability
  • Performance
  • Disaster recovery
  • Business continuity

These factors will specific to each organisation implementing a tool.

Operational Management

Operational management covers the planning, development, delivery and day-to-day operational procedures relating to the tool. This includes appropriate levels of service governance, project management, resource management and analytical reporting, as well as compliance to relevant standards and guidelines.

ITIL provides the recommended best practice approach to IT service management which includes operational management.