Library

Browse and search developer information

Predictive Model Building and Selection

By Connecting for Health | 2012

Data Sources

Any predictive model must be built from a sufficiently large corpus of data. In data mining terms this data consists of examples that a data mining analysis can extract patterns from – a model. An example represents the entity that is of interest, the entity of which you wish to ask questions. In health and social care the example normally represents an individual. Associated with an example is a collection of attributes, where an attribute measures a specific aspect of an example. In health care the age of an individual and their gender are attributes that are for example commonly used in most models. The collection of examples is normally called the training dataset, as in data mining terms the data mining analysis (also sometimes referred to as machine learning) uses the dataset for training the model.

Choosing which attributes to include in an example and which examples to collect into a training dataset are outside of the scope of this article. However in practical terms such decisions will be constrained by the available data sources.

The attributes selected must contain the attribute(s) that represent(s) the desired outcome(s), for example emergency admission in the next 12 months. Other attributes represent mining fields that the data mining algorithm will process and select a subset for inclusion in the model as predictor attributes. Additional supplementary attributes will also normally be included that are used to identify or annotate individual examples but are not used by the data mining algorithm.

In health and social care the scope of the training dataset is normally set by geographical boundaries or commissioner responsibility. This usually defines a baseline population of individuals for which appropriate attributes need to be sourced.

For geographical boundaries census or electoral roll data sources could be considered, as well as the DH Personal Demographics Service (PDS). For any of these sources some individuals may be missing as they for example have not registered on their local electoral role. The Electoral Commission estimate that based on Census records the completeness of the electoral registers was at 91–2% in 2000. However there was a disparity of completeness across different geographical areas and social groups. For example in their study areas they found under-registration is notably higher than average among 17–24 year olds (56% not registered) and black and minority ethnic British residents (31%). For health and social care under-representation of these social groups in the training dataset could well lead to significant biases in the predictive model if causes of morbidity and mortality are related to social, environment, economic or ethnic dispositions.

For more information on the Census see:

http://www.ons.gov.uk/ons/guide-method/census/2011/index.html

For more information on Electoral Rolls see:

http://www.electoralcommission.org.uk/elections/voter-registration

For commissioner responsibilities existing provider registration data sources could be considered. For health commissioners such as Primary Care Trusts (PCT) and the new Clinical Commissioning Groups (CCG), patients registered with their constituent GP Practices can be used. This data can be sourced either directly from individual GP systems, centrally from NHAIS systems that hold patient registration data for PCT GP Practices or from PDS, which as the national electronic database of NHS patient demographic details holds data such as name, address, date of birth and NHS Number.

Patient registration data from GP systems can be accessed using the system’s appropriate proprietary interface. As this is different across different vendor’s systems where a population needs to be defined based on data from several different types of GP systems, the system integration effort to retrieve and collate data increases.

NHAIS has the advantage of providing the same interfaces to all GP patient registration data. Specifically NHAIS provides the Organisation Links interface and the Open Exeter Web Service interface that allow both the pushing and pulling of registration data. PDS also provides the opportunity of a single interface to all national GP patient registration data.

Within health care modelling mining fields will often be sourced from many different delivery areas such as:

  • In Patient (IP) encounters
  • Out Patient (OP) encounters
  • Accident and Emergency (AE) encounters
  • General Practice (GP) consultations
  • Prescription events

Data relating to these different types of encounters and which relate to the population to be included as examples in the training dataset, will be held in multiple health care provider systems. The existence of a relationship between a member of the population and any provider is difficult to predetermine except for their GP Practice. Therefore it is usually impractical to source data directly from providers as this would entail exhaustively polling every possible provider system to see if they had any relevant data with the consequent technical issues of discovery of systems and system integration.

A better approach is to use centralised data sources which have collected data from individual providers. For IP, OP and AE data the Secondary Users Service (SUS) provides such a centralised data source.

For GP data there is currently no centralised data source, although the General Practice Extraction Service (GPES) may be able to provide this in the near future. Therefore GP data needs to be sourced from individual GP systems. This can either be achieved through using GP system proprietary interfaces or by using MIQUEST. MIQUEST provides a single interface to all GP systems that allows you to formulate data queries in a Health Query Language (HQL) to retrieve specific patient medical data.

For prescription data, health providers such as GP practices and hospital clinical systems will contain prescription information within the medical records for individuals.

For social care and community care data there is no centralised data source. The information systems for individual providers will need to be used. Currently there are no operational predictive models that use social care data.

Irrespective of which data sources are used it is recommended that a comprehensive data dictionary is constructed for each individual data source. This will be needed for subsequent processing of data in the model building process. A data dictionary will consist of a list of data dictionary items. Each item corresponds to an attribute and has the following properties:

  • Name – this is often just the name or identifier of the data item or field in the data source, this should be unique within the data dictionary for the data source.
  • Description – attribute names are sometimes cryptic, therefore it is a good idea to have a textual description of what the attribute represents.
  • Type – this defines the data type, for example numeric or alphanumeric.
  • Size – most data sources will set a maximum size for an attribute, for example a character string of up to 50 characters.
  • Missing Value – some data sources may be able to distinguish when a value for an attribute is missing for whatever reason, and be able to record this fact. This is often done by using a specific value that normally an attribute couldn’t have to represent a missing value for the attribute.
  • Unknown Value – some data sources may be able to record the fact that the attribute value is unknown. For example an attribute might represent “age of onset of condition” which in a source system is elicited directly from a patient. The patient may not know and the source system might record it as unknown. Representing an unknown value is often done by using a specific value that normally an attribute couldn’t have. However where an attribute is coded (see next property) the code set often includes a code for unknown.
  • Codes – some attributes represent code values. The associated code set should be documented. This normally consists of a description for every code value. However some code sets maybe very large, for example SNOMED, in which case an explicit reference to the code set will be sufficient.

Note that most people do not normally maintain separate data dictionaries for each data source, but collapse them into one data dictionary with an additional Data Source property.

Data dictionaries can be collated using a range of software tools ranging from specialised data modelling software to a simple Excel spreadsheet as shown below in figure 1.

Figure 4 - Data dictionary.
Figure 1 – Data dictionary.

Data from a data source will have a data structure or data model. It is recommended that this is documented as it will also be needed for subsequent processing of data in the model building process. In the majority of cases this can usually be described in terms of the familiar Entity Relationship (ER) representation. An ER representation defines the data entities with their attributes and the relationships, roles and cardinalities between them.

ER representations can also be recorded using a range of software tools ranging from specialised data modelling software to a simple Visio diagram as shown below in figure 2.

Figure 5 - ER diagram.
Figure 2 – ER diagram.

Data received from a data source will have a physical format. It is recommended that this is documented, as you need to understand the format to be able to physically load the data into your data storage. This format could be for example:

  • XML file or message
  • CSV text file
  • EDI message

Data received from a data source will be transported using a specific mechanism. It is recommended that this is documented, as you need to understand how to physically move data from a data source system to your own data storage. Transport mechanisms could be for example:

  • FTP
  • Email
  • Web services
  • Physical media CD/DVD

Consideration should obviously be given to access authentication and security of transport for all mechanisms.

Data Quality

The accuracy of a predictive model will be affected by the quality of the data taken from the data sources.

Data quality will be determined by the amount of missing values and by the amount of inaccurate or wrong values in the training dataset.

Most data mining analysis methods do not deal well with missing values, however there are several common approaches used to eliminate missing values from the training dataset:

  • Ignore examples containing missing values
  • Fill the missing values manually
  • Substitute the missing values by a global constant or the mean of the examples
  • Get the most probable value to fill in the missing values – normally called imputation

It is recommended that a quantitative assessment of the amount of missing data is carried out and some form of imputation is used if required.

The issue of inaccurate or wrong attribute values is much more difficult to determine. It is not recommended that you assume that any data extracted from a data source is clean and error free (whatever the reassurances given by the source organisation). You should carry out your own due diligence to check the quality of the data you are going to use to create the training dataset.

All data items within a data source will have a set of associated validation and credibility checks. A validation rule defines constraints that values for a data item should meet. This will also include constraints between values for different data items. Examples are:

  • Value for data item X should be within a specific numeric range
  • Value for data item Y should be in a set of specific values (code set)
  • IF value for data item X defined THEN must be greater than value for data item Z

Where the same data item value is present in more than one data source, then you can also carry out cross validation across data sources. This acts as a measure of data accuracy.

Whereas a validation rule is either passed or failed, a credibility check indicates where a data item value might be in error and prompts further investigation. For example:

  • Value for data item X for most individuals will normally be within a specific range

Validation rules and credibility checks are often implicit; it is therefore recommended that they are explicitly documented.

Validation rules and credibility checks can be implemented in a variety of ways. However the most common approach is to load the raw data from the sources into staging tables within a database and then implement the validation rules and credibility checks in bespoke code. The code traverses the database tables running each database table record through the appropriate rules and checks. The detailed results as well as aggregate counts are written to some form of data quality report. The code can be implemented either within the database itself, as for example by stored procedures, or in an external application, for example a Java or .NET application.

Any identified errors in the data can be addressed in a similar manner to that of missing values as outlined above. This process is often called data cleaning. As for data quality checking this is most often implemented as bespoke code running against staging tables in a database.

Data Sharing

Using data sources to create a training dataset will involve sharing data between organisations within and across health and social care. Such data sharing will require careful consideration of Information Governance (IG) obligations, constraints and controls. Key to these issues is classification of data into patient identifiable data (PID) or anonymous patient-based data, as the former requires considerable tighter IG.

GP systems use the Read coding system as a standard terminology for describing the care and treatment of patients. Included in the Read coding system are specific codes to record a patient’s consent to share data in different contexts shown in table 1.

Table 1 - Read codes for sharing.
Table 1 – Read codes for sharing consent.

Some contexts are precisely defined. Other contexts are more ambiguous and not specific to data sharing requirements for a specific model building exercise or subsequent data marshalling process. Therefore it is recommended that where you do use Read consent codes as a means of checking consent to share at a patient level, there is a common understanding in place across data providers, data processors and data consumers as to what consent these codes are being used to represent. You should also be aware that some data extraction mechanisms maybe cognisant of these codes and subsequently block extraction of non-consenting patient records.

Data Linkage

The data taken from different data sources will relate to specific individuals. To construct the training dataset these data need to be linked together so that all data items that belong to a single individual are identifiable. In the simplest case this is achieved by constructing a training dataset that consists of a single table where each row represents an individual (example) and each column a data item (mining attribute). All data mining products will be able to analyse training datasets in this structure, while some products can also analyse training datasets in a relational format where multiple tables are linked by foreign keys. However it is recommended that you try and construct a single table training dataset to ensure you do not restrict your choice of data mining products.

Within health care, an individual is normally uniquely identified by their NHS number. Most health care records will be indexed by NHS number; therefore it is recommended this is used for data linkage. However not all health data may have an associated NHS number and even when present it may be incorrect. NHS numbers can be verified or unverified, where the former has been checked against PDS to ensure the NHS number is associated with the correct individual based on demographic data. Some health records will have the verification status included.

The use of the NHS number in social care is currently not widespread; therefore linkage of data within and across social care data usually relies on matching based on a minimum of:

  • Surname/Last Name
  • Date of Birth
  • Gender

Data sharing agreements may place restrictions on access to personal identifiers within the data provided by data sources. At one extreme is the provision of fully anonymised data which contains no data items that can be used to identify an individual. For data linkage such data is unusable as there is no method of linking data across multiple data sources. An intermediate position of pseudonymised data is where the personal identifiers have been removed and replaced with a unique identifier. The unique identifier can be linked back to the personal information via a key, which is held securely and separately from the pseudonymised data. There are basically three approaches to pseudonymisation:

  1. A data source implements a local algorithm to apply pseudonymisation. Other data sources may implement their own different algorithm.
  2. A data source implements a common algorithm to apply pseudonymisation. Other data sources implement the same algorithm.
  3. A data source sends identifiable data to a central secure safe haven with clear identifiers using secure communications and pseudonymisation is applied centrally.

Approach 1 is impractical for data linkage as psuedonymised data from multiple sources which have used multiple algorithms will have different unique identifiers for the same individual. Approach 2 is practical but requires all data sources to implement the same algorithm; this may not be technically or commercially feasible. Approach 3 is practical but requires investment in the provisioning of such a central service.

It is recommended that the method of matching data is explicitly defined and documented to minimise the risk of mismatched data. If some data is mistakenly linked the training dataset will contain invalid associations of data values which will affect the external validity of any generated predictive model.

When linking data some data records may not be matched and depending on the context of the data sources involved this can either be ignored or recorded as a data quality issue. Within the documented data model one entity is usually taken to represent the baseline population (for example list of registered patients with a GP Practice). Data linkage normally works out from this baseline population across the other data records from the different data sources. A data source (for example Inpatient Data) may or may not have any data that relates to an individual in the baseline population, so not matching any data for a specific individual is not a issue. The same data source is also likely to have lots of data relating to individuals not in the baseline population, which again is not an issue. However a data source may be expected to have a one to one relationship with the baseline population and if no data relating to an individual in the baseline population is found this should be regarded as a data quality issue.

It is recommended that a quantitative assessment of the amount of missing data is carried out during linkage. As discussed previously in data quality, there are various approaches to dealing with missing data. However in data linkage complete data records may be found to be missing not just single data item values, so the recommended approach is to ignore examples with missing linkage data.

Data linkage can be implemented in a variety of ways. However the most common approach is to build on the staging tables within a database approach used for data quality checking. The data linkage logic is implemented in bespoke code. The code traverses the staging database table containing the base population. For each record it finds it tries to link with other appropriate records in other staging tables. If a single table training dataset approach (as recommended) is being followed, then this linked set of records is written out as a single flattened record to a database table that will hold the training dataset. The detailed results of the linkage as well as aggregate counts are written to some form of data quality report. The code can be implemented either within the database itself, as for example by stored procedures, or in an external application, for example a Java or .NET application.

Data Storage

Data that represents raw data from different data sources, cleaned data and the resultant training dataset needs to be stored and managed. Storage can be implemented as flat files or within a Relational Database Management System (RDBMS). As raw data will often be supplied in different physical formats and both data quality checking and data linkage processing require flexible and uniform access to data, it is recommended that a RDBMS is used for data storage and management.

There are many commercial RDBMS (for example Microsoft SQL Server and Oracle Database) and open source RDBMS (for example MySQL and FirebirdSQL) that are suitable for predictive model building.

As preparing the training dataset is a process that is not time or resource constrained, data storage need not have high availability, high reliability or good performance. So for example a RDBMS running on a desktop computer may well be sufficient. However security and backup/recovery should be addressed. If the data contains personal identifiers then normal good practice security measures should be applied; for example password protect access to computer, set user login to RDBMS, set user role in RDBMS, set access control on database within RDBMS. Although availability and reliability are not important for data storage when model building, it is important to protect your data from accidents and failures. If you have spent many days carefully checking, cleaning and linking data into a training dataset, if you do not want to lose all that work through a hard disk failure you need to backup the data storage on an appropriate schedule. Also remember to test your recovery process.

Consideration should be given to the storage volumes required, which will be determined by the scope of the data being used to build the model. Even for reasonably large data volumes of around say 1 Terabyte, these can be comfortable accommodated on a modern desktop computer without the need to resort to specialised storage solutions such as Network Attached Storage (NAS) or Storage Area Networks (SAN). If storage space is an issue, then most Operating Systems and RDBMS support data compression technology. This will reduce required physical hard disk space but will introduce a performance penalty; however performance is not important in the model building process.

Other issues to consider when deciding on an appropriate RDBMS to use are:

  • Import facilities – most raw data will be supplied in flat file format and will need to be loaded into staging tables within a database.
  • Export facilities – the resultant training dataset is often exported to a simple text file for processing by the data mining software. You may also want to be able to distribute your training dataset to other model builders so they can test their own models against your data (see the Model External Validity section later on).
  • Programming facilities – all RDBMS will support SQL and most provide means of packaging SQL into routines often called stored procedures.
  • Application Programming Interfaces (API) – different programming languages and runtime systems support different database connection APIs, for example ODBC, OLE-DB, .NET Data Providers and JDBC.

 Model Creation

We have discussed how a training dataset can created which will then be processed by data mining software to create a predictive model. The process to create a training dataset is summarised in figure 3.

Figure 6 - Creating a training dataset.
Figure 3 – Creating a training dataset.

There are a wide range of different types of predictive models. These include:

  • Association Rules
  • Baseline Models
  • Cluster Models
  • General Regression
  • K-Nearest Neighbours
  • Naïve Bayes
  • Neural Network
  • Regression
  • Ruleset
  • Scorecard
  • Sequences
  • Text Models
  • Time Series
  • Trees
  • Vector Machine

For each type of model the method of analysing the training dataset is different and the structure of the model produced is also different.

A logistic regression model (a Regression model type) is an example of a model type that has been commonly used in health care. The PARR (Patients At Risk of Re-admission) family of risk stratification tools and the more recent CPM are examples of logistic regression.

A logistic regression model is structured as a regression equation of the form:

Log odds = Intercept + Variable1 * Beta1 + Variable2 * Beta2 + … + VariableX * BetaX
Probability = 1 / (1 + (exp(Log odds)))

Where Variable represents the value of a predictor attribute (for example gender) and Beta is a coefficient value to multiple it by (for example 0.3456), and Intercept is a constant.  By using the model Beta and Intercept values with Variable values for a specific individual the probability of the outcome for the individual can be calculated.

There are many statistical debates about the strengths and weaknesses of different types of predictive model which are outside the scope of this article. There is also some debate in the health care domain focused on the explanatory power of different model types. As stated previously, most of the health care predictive models are Regression. This in part can be explained by the fact that most model builders working in this area come from a medical statistics background and are more familiar with statistical modelling than the more generic data mining. This does however lead to some confusion, as regression analysis is also used to determine statistically significant relationships between different variables – which is then often used to infer causal relationships between variables. For predictive modelling however all we are concerned about is the ability of a model to accurately predict, not how it achieves it (apart from its internal validity as discussed later). Some model types, for example Neural Networks, are very hard to understand how they work compared to for example regression equations. This is sometimes used as an argument to criticise some model types on lack of explanatory power.

There is a wide range of data mining products available that can be used to build predictive models, some of which are also open source software. Below is a representative list of companies/projects that offer data mining software as shown in table 2.

Table 2 - Data mining products.
Table 2 – Data mining products.

Choice of a suitable data mining product will be based on a variety of product features and local factors such as:

  • The range of model types a product supports
  • The range of training dataset input methods
  • Authorised and supported products within your organisation
  • Product licensing and support costs
  • Existing product experience and skills within your organisation

A data mining product will process the training dataset and based on the type of model selected as the target model, will analyse the data to:

  • Select which attributes best predict the outcome(s) – the predictors
  • Calculate the algorithm that should be applied to the predictors – for example in a regression model this will be the coefficients and intercept
  • Output measures of the validity of the model
  • Output the calculated predictive model

Although the model building process has been presented as a linear one of creating the training dataset followed by model creation, in practise it tends to be a more iterative one – where based on the quality of the model you may decide to revisit the creation of the training dataset to try for example alternative data sources.

Model Internal Validity

Data mining products when calculating a predictive model will also generate a set of measures to allow you to assess the internal validity of the model. Internal validity refers to the internal consistency and conformance to modelling assumptions of the generated model. Internal validity does not tell you anything about how accurate or useful the model is – this is termed external validity.

A model that is not internally valid should not be used, irrespective of its external validity, as the model is not intrinsically sound.

The relevant measures of internal validity are dependent on the type of model created and are outside of the scope of this article. However examples for regression models include:

  • Multivariate distribution of data (normal distribution)
  • Homoscedasticity
  • No autocorrelation
  • Relevant sample size and power
  • Collinearity
  • No multilevel modelling effects

As can be seen for the above example, professional statistical and/or data mining advice, is often needed to verify the internal validity of a model.

Model External Validity

Model external validity refers to measuring the predictive accuracy of the created model. Most data mining products can measure this in a variety of ways. The simplest and the one that seems to be most often used in health care modelling is to take a portion of the training dataset as a test dataset. For example it is common practice to collate a training dataset of four years of data for building a health care predictive model. The last year of data is then set aside or held-out as the test dataset. This test dataset is then run through the model and the predicted outcomes compared with the actual outcomes recorded in the data. This is referred to as hold-out estimate of predictive accuracy.

There are more elaborate methods of measuring predictive accuracy based on re-sampling and cross-validation that are outside of the scope of this article.

Accepted measures of accuracy include:

  • R-squared
  • Positive predictive value
  • Sensitivity
  • Specificity and negative predictive value
  • Various measures based on the Receiver Operating Characteristics (ROC) curve

For a description of these measures see Lewis, Curry and Bardsley (2011).

These models are often used to compare the accuracy of different models. However remember these comparisons are only valid if the models are predicting the same outcome(s).

As a model builder it is also recommended that you consider making your training dataset and test dataset available to other model builders. They can then measure the predictive accuracy of their model directly against your data. Consideration must obviously be given to appropriate anonymisation of the data and its structure/format. As recommended previously, the training dataset is best created as a single table as this will aid portability between different data mining products.

Model Representation

Most data mining products will output a created model in a proprietary format. This may be range from a proprietary binary file format to a simple on screen textual description. Such proprietary formats of model representation make it difficult to describe and understand models in a consistent way, and inhibit the sharing of model descriptions and subsequent interoperability.

What is needed is a standardised model representation format that allows the unambiguous description of all model types that can be read by both people and systems.

Fortunately such a standard exists in the data mining community and is called Predictive Model Markup Language (PMML).

PMML is an open standard maintained by the Data Mining Group (see http://www.dmg.org/). PMML uses XML to represent a wide range of model concepts as shown in figure 4.

Figure 7 - PMML.
Figure 4 – PMML.

It supports the full range of model types previously outlined, including regression, neural networks and decision trees.

A software product can support PMML as a producer and/or a consumer. Most of the data mining products listed previously support PMML as a producer, so created predictive models can be saved in PMML format.

It is recommended that as a model builder you represent your predictive model in PMML format. Note some commercial predictive models are developed and then sold as black box predictive tools, where customers and users are restricted from knowing how the underlying model works. In these situations vendors are unwilling to share a model representation in PMML or any other format.

Although PMML can fully represent a predictive model it is recommended that as a model builder you also provide a data dictionary document describing the predictors and outcomes to aid model users in marshalling data for input to any prediction tool that uses the model.

As a working example of PMML, the CPM logistic regression equation has been taken and turned into a PMML representation.

The PMML file begins with a standard header as shown in Figure 5.

Figure 8 - PMML header.
Figure 5 – PMML header.

It is then followed by a data dictionary which defines the variable types that can be used in the model. Note a new CPM variable called TXID has been added. This is used to represent a Transaction ID to uniquely identify an individual when creating a prediction. This can be set to any value by the system passing data to a prediction tool and will be returned along with the outcome probability. When anonymity is required the TXID can be generated by the client system of the prediction tool. It will then retain a mapping to a individual identifier such as NHS number, so that when it receives a outcome probability along with TXID it can relate it back to the specific individual.

Figure 9 - PMML data dictionary.
Figure 6 – PMML data dictionary.

The data dictionary is followed by the model definition. For CPM this is a “logisticRegression” model type. The first part of the definition declares the mining schema, which identifies which variables in the data dictionary are used in the model and what role they play.

Figure 10 - PMML regression model.
Figure 7 – PMML regression model.

This is then followed by the declaration of the regression intercept and variable coefficients.

Figure 11 - PMML regression table.
Figure 8 – PMML regression table.

The utility of using PMML is that it is now possible to fully decouple the model producer from the model consumer (a prediction tool). As an example the company Zementis provide a product called ADAPA, which is a standards-based (PMML), real-time scoring engine (prediction tool) accessible on the Amazon Cloud as a service.

To use the service you need to be registered (a free trial registration is available). You then login to the ADAPA console.

Having logged in you can upload a predictive model definition in PMML format, in our case the CPM definition. ADAPA validates that the PMML file is valid and then effectively builds a predictive tool instance based on the predictive model definition dynamically.

ADAPA then allows you to make predictions by either calling a set of web services or just by uploading a simple CSV text file that contains the predictor values for a set of cases, one case per line.

In ADAPA you select to score or predict data. You then select the PMML model to use (CPM in our case which we have just uploaded) and upload the CSV data file to run through the model.

You can then view the outcome predictions and/or download the results as a text file.

Model Maintenance

A model is generated from a training dataset. A training dataset represents a baseline population over a specific time span. The characteristics of the baseline population will naturally change over time and as pointed out earlier the successful application of predictions made by the model as interventions will also affect change within the baseline population. Both these factors indicate that the external validity of a model may well change over time and thus a model should undergo regular maintenance.

Model maintenance consists of first reassessing on a regular period the external validity of the predictive model. The period will be determined by the anticipated rate of change in the baseline population. This will be in the order of several years. Some model builders in the health care domain take three years as a rule of thumb.

The reassessment of external validity is normally carried out by building a retrospective test dataset for the same base population and from the same data sources that was used to create the initial training dataset. Data covering the time span from the last reassessment (or initial model creation) to the end of the period is used. As in the training dataset this will contain values for both predictors and outcomes. The process of creating this test dataset follows that already outlined previously in creating a training dataset.

The same measures of external validity should be applied to the test dataset as was originally applied when creating the model. The values of the current and previous measures of external validity should be compared to determine if there is any trend in changes of external validity. The new values of measures of external validity should be used to decide if the model is no longer valid (accurate enough) and needs re-generation.

If the model is no longer valid it needs to be re-generated. This can be achieved by using the test dataset as a new training dataset and building a new model using the data mining product as previously described.

It is important to note that this new model may be significantly different from the previous model in that the new model may have different predictors. For example re-generating a logistic regression predictive model may not just create a model with different valued intercept and coefficients, it may drop predictors used in the previous model and introduce new ones.

The implications of this are that model maintenance can have a profound impact on the operationalisation of a predictive model in the form of predictive tooling. If an in use model changes – both the predictive tool that uses the model will need to be changed and possibly the data marshalling process used to collate input data for the tool. Where predictive tooling and data marshalling are technically and/or commercially constrained any changes can be difficult and/or costly.

Model maintenance should be based on model re-generation to ensure model internal validity is met. Model maintenance based on adjusting model parameters based on some form of parameter modelling, often called recalibration, should be avoided as the internal validity of the new model cannot be assessed.