Library

Browse and search developer information

Pseudonymisation

By Health & Social Care Information Centre | 10 April 2013

Introduction

Psuedonymisation is used to mask the identify of patient data when it’s shared with third parties for things like studies and trials. In essence, it replaces primary keys in the data, such as NHS numbers and Post Codes, with replacements that maintain uniqueness but prevent identification. It’s not perfect, and can never can be – if a group of individuals has a rare disease then it may still be possible to guess who one of them is. In particular, dates of birth are difficult to pseudonymise, because of the very limited range. Post Codes are often too specific. Both are often best replaced with aggregation at a higher level, such as month and year or postal town.

There are a number of different ways to pseudonymise data. We’ve written a client to demonstrate how a number of them from the PIP project can be implemented in Java.

This client is such that it can be used either as a standalone program (in which case, it produces results in TSV format for ease of use with standard POSIX tools) or as a java library (jar). To make things as easy as possible, we’ve also written some wrappers for POSIX and Debian/Ubuntu. Your choices, in order of decreasing convenience, are:

  • Debian/Ubuntu deb packages, hosted in our apt repository at http://services.developer-test.nhs.uk/repositories/apt/hdn/
  • A tar ball, which contains a complete file system to untar over your root /. These should work on any POSIX system, including Mac OS X and Cygwin.
  • A standalone java jar, with all dependencies included, suitable for execution or as a library
  • A set of java code libraries with source
  • Forking from github

Pseudonymise data

The best way to get going is to use the command line. We’ll look later on how to create requests programmatically using the java library.

Using the command line client to pseudonymise data

The way you do this varies depending on what you used above:

  • If you’ve installed the deb package or the tar ball, you’ll have the program hdn-pseudonymisation-client on your PATH. To use it, open a terminal console and type hdn-pseudonymisation-client. It takes standard POSIX options.
  • If you’ve installed the standalone jar file, you’ll need to run commands from the from the folder you downloaded the file to. Open a terminal console and change folder to the folder it is in. Type java -jar hdn-pseudonymisation-client.jar. It takes the same standard POSIX options as the program above. For the rest of this document, wherever you see hdn-pseudonymisation-client … you can substitute java -jar hdn-pseudonymisation-client.jar …
  • If you’ve downloaded or forked source from github, you can use IntelliJ to run the main class. Open source\subprojects.ipr and run the main class uk.nhs.hdn.psueonymisation.HdnPseudonymisationClientConsoleEntryPoint. There are already some sample configurations set up for you to debug in IntelliJ. If you don’t have or use IntelliJ (and really should) then you can open in Eclipse or NetBeans. You’ll need to add the libraries ‘annotations’ (library/annotations/VERSION/annotations.jar) and ‘jopt-simple’ (library/jopt-simple/VERSION/jopt-simple-VERSION.jar) to the class path.

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-pseudonymisation-client –help (remember to substitute java -jar hdn-pseudonymisation-client.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

Option                                  Description                           
------                                  -----------                           
--data-kind <DataKind: Post Code or     One of post_code or nhs_number        
  NHS Number data is being                (default: nhs_number)               
  pseudeonymsied>                                                             
--file <Path to data-kind file (one     Path to file of data-kind (default: -)
  per line, LF separated), or - to use                                        
  standard in>                                                                
--help                                  Displays help for options             
--pseudonymsiers                        One of Signed32BitSequence,           
  <PseudonymisationAction: one or more    Signed64BitSequence, UUID,          
  pseudonymisation operations>            Random8Bytes, SHA512,               
                                          SHA512First4Bytes,                  
                                          SHA512First5Bytes, SHA512First6Bytes
                                          or SHA512First8Bytes (default:      
                                          [Signed32BitSequence,               
                                          Signed64BitSequence, UUID,          
                                          Random8Bytes, SHA512,               
                                          SHA512First5Bytes,                  
                                          SHA512First6Bytes,                  
                                          SHA512First8Bytes])                 
--version                               Displays version

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-pseudonymisation-client -h and hdn-pseudonymisation-client –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Let’s try out one of those options: –version.

Checking the version installed

Let’s run hdn-pseudonymisation-client –version:

hdn-pseudonymisation-client 2013.03.13.1202-development
© Crown Copyright 2013

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Written by Raphael Cohn (raphael.cohn@stormmq.com)

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.13.1202-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Pseudonymising Data

The client reads rows of data, by default from standard in (stdin), and then applies your chosen pseudonymisations. The following assumes a basic Posix command line, like bash. So, let’s try pseudonymising a NHS number:

echo "1234567881" | hdn-pseudonymisation-client

This produces two rows of data to standard out (stdout):

nhs_number  Signed32BitSequencePseudonymiser(0) Signed64BitSequencePseudonymiser(0) UuidPseudonymiser() JavaNoQuiteSoSecureRandomNumberGeneratorPseudonymiser() HashPseudonymiser(64, SHA-512)  HashPseudonymiser(64, SHA-512) Salt HashPseudonymiser(4, SHA-512)   HashPseudonymiser(4, SHA-512) Salt  HashPseudonymiser(5, SHA-512)   HashPseudonymiser(5, SHA-512) Salt  HashPseudonymiser(6, SHA-512)   HashPseudonymiser(6, SHA-512) Salt  HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt
1234567881  0x00000000  0x0000000000000000  0x3D3AC21623AA4F15B31303B95EF2EEB7  0xB934FDA023FD341D  0x8482BD431C5ED552E19B94361E6F68F5EBAC2CC093A595C3DD9B6C3485F021D1B76E244C5C2DCD1871710C6D58DAF8EE25A5C582FDFC01E41B03B9EC4B02D306  0x2E76736FB36F050BC3B0924FF1546081DC32744583E3B5807D63C97D672D6CCAED86C464A65668D68A31A40C1EBF2EE5A4C51E7D20226CB9CFA58AB3C6BD19E3  0x6B7FD4FA  0x71CD1DE521D6675DE4742D4E2D2C5B14F0A116CC4A801BD3016B89E45AC7DFDBD424112982A240C21ECBDB40EFE18ED6C92702AC9D62AE732CB76E12C86CE7E0  0x5D33A07630    0xA01F846BC6EF7A7A0519033AD83832C84D65FD648E7D92DC53C67D47076B22838817ABE2E5773D58B21AE9D5411DF5193D61E52543CD16558076F36CD3159DA2  0x4B31CF22C03B  0xCC29480F4C0C4F40B033A0F223FB5BEDDFE6AF08AA999DC756BD261726517A08233A9C6E79870042C55555B521B18A7F234CD8FD443AB91643558F46D3DEEC0B  0x15D917830CA605    0x1C1E4FDE0B17EED4FBE410BD52B934BFF1916A33450A064DD25E19CE3367453DD7E083C09925B581A4DB8A87B14B78D04C2D52598F7BFB109CFE558B8173D3EC

That looks like a lot, but it isn’t. It’s output as tab separated value (TSV) data. The first line is a header line. There’s then one or more rows of data. The first column is the original NHS number. If you try this example you’ll get different results for some columns – these are when the output depends on CSPRNG to its work. Pseudonymised values are reported as upper case hexadecimal.

By default, we apply all possible pseudonymisations. If you want only one or two you can specify them explicitly:

echo "1234567881" | hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes --pseudonymsiers Random8Bytes

Which gives:

nhs_number  HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt  JavaNoQuiteSoSecureRandomNumberGeneratorPseudonymiser()
1234567881  0x8AB6E347B06621    0xF988F41E2CBF3D3D8F2E1A2A4D7D88F478ABA7DE634F4277AF608914ED688C3EA49621A8542A5157DE2AAF959BA86D7DD64C20998B495ACD5A9C5EE8A55871B4  0x77F5813558FF91A3

If you specify more input lines you get more output lines:

printf "1234567881\n1231231238\n" | hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes

Gives:

nhs_number HashPseudonymiser(7, SHA-512) HashPseudonymiser(7, SHA-512) Salt 1231231238 0x47F672D32B7075 0x696642F84A47FBFF86969769278E153F0B7FCDCAD4AE9E8A0FD7C447A12649C3F294F57258BB8F554ADAEDB6099FBADA53BBB4AC48E4E84F01B77044C1279064 1234567881 0xD21823FCE906D0 0xB48C0DFA94DB67CF5D6E9F11829EBB4EBC42CD0CDC8E16143BE6C2359FD3AF3F522840A04C7EE28FBDEA7F831298BD46D15BB78F734E39F8672A1

If there’s a duplicate in the input data then you’ll get less lines. So:

printf "1234567881\n1234567881\n" | hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes

Gives just one data line:

nhs_number  HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt
1234567881  0x8563F91EABB92F    0xCD84ED9E44243ED19A0E51ACB43B7ABC0B629917FF9CDACD88720837250DC0D093B2950B334137FD2D5C7E586B52EAAD1A33E71BAFA3D43C5AB73A28EAA8160E

If you don’t want to use standard in, you can read from a file:

Set up a file, eg –

printf "1234567881\n1234567881\n" >/tmp/x

Then use it –

hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes --file /tmp/x

Which gives –

nhs_number  HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt
1234567881  0x501E7CEEA499E7    0x6DEC85D5CE880FFB82E142F0618AA85FFD4197F91BF7E7401F61F1B384815FC1D8AD94D9852A0B4F17A7FA1F137B27C74FEBAFCF9129AE4120ED470659106027

If you forget to use a file or pipe something to standard in, you’ll get a stack trace:

java.lang.IllegalStateException: It appears that stdin is not provided
    at uk.nhs.hdn.pseudonymisation.client.HdnPseudonymisationClientConsoleEntryPoint.execute(HdnPseudonymisationClientConsoleEntryPoint.java:77)
    at uk.nhs.hdn.common.commandLine.AbstractConsoleEntryPoint.execute(AbstractConsoleEntryPoint.java:354)
    at uk.nhs.hdn.common.commandLine.AbstractConsoleEntryPoint.execute(AbstractConsoleEntryPoint.java:68)
    at uk.nhs.hdn.common.commandLine.AbstractConsoleEntryPoint.execute(AbstractConsoleEntryPoint.java:62)
    at uk.nhs.hdn.pseudonymisation.client.HdnPseudonymisationClientConsoleEntryPoint.main(HdnPseudonymisationClientConsoleEntryPoint.java:52)

Of course, you also choose to pseudonymise Post Codes:

echo "LS1 1XX" | hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes --data-kind post_code

Giving:

post_code   HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt
LS1 1XX 0xC47833E11F07B6    0x4728EAC0002E2CA9A647532290E04DC88902A342B32A081BAF6963602E9C589BC1B94D97D295C86376842C80390C227287846699EE41EDA86570079754B4B9AE

If you omit the value for a column, then we interpret it as null.

echo "" | hdn-pseudonymisation-client --pseudonymsiers SHA512First8Bytes --data-kind post_code

Produces

post_code   HashPseudonymiser(7, SHA-512)   HashPseudonymiser(7, SHA-512) Salt
    0xB71AF686BBED6E

Note that the salt is also null – this is because we don’t run hashes on null values (a null value is not the same thing as an empty value, which can be hashed). As a result, null values are not stable between runs, which is what you need.

Using the java library to pseudonymise data programmatically

The way you do this varies depending on what you used above:

  • If you’ve downloaded or forked source from github, you can use IntelliJ. Open source/subprojects.ipr and start hacking.
  • If you’ve downloaded the jars (and source zips), create a project or add them to an existing project in your favourite IDE (if it isn’t IntelliJ, then switch now).

You can either add the hdn-pseudonymisation-client.jar as is (simple option), or, for more refinement and better integration with other HDN tools, its dependent jar files:

  • pseudonymisation
  • common-naming
  • common
  • common-postCodes
  • dts-domain
  • number
  • common-parsers-separatedValueParsers
  • common-serialisers-separatedValues
  • and the third-party library annotations.jar. This is a compile-time only dependency.

This list may change. To find the most up-to-date list, either extract META-INF/MANIFEST.MF from hdn-pseudonymisation-client.jar and read the Class-Path entry, or open the IntelliJ project (source/subprojects.ipr) and look at the dependencies of the module pseudonymisation-client (sensibly, module names match jar names and source zip names). Note that you’ll not need the common-commandLine module (jar) or jopt-simple library.

Pseudonymising Data

The ‘guts’ of the java library’s API is the class MapIndexTable. It’s in the package uk.nhs.hdn.pseudonymisation.

Using it is quite straightforward:

final IndexTable<NhsNumber> nhsNumberMapIndexTable = new MapIndexTable<>();
final Pseudonymiser<NhsNumber> pseudonymiser = new HashPseudonymiser<>(SHA512);
pseudonymiser.pseudonymise(nhsNumber, nhsNumberMapIndexTable);

The first line creates a new MapIndexTable to store psuedonymised results. This object also manages duplicates.

The second line creates a Pseudonymiser. There are several kinds, available in uk.nhs.hdn.pseudonymisation.pseudonymisers.

Let’s fill in the value for nhsNumber:

final NhsNumber nhsNumber = valueOf("1234567881");

Lastly, this is just one way we might be able to get to our results:

nhsNumberMapIndexTable.iterate(new PsuedonymisedValuesUser<NhsNumber, ShouldNeverHappenException>()
{
    @SuppressWarnings("UseOfSystemOutOrSystemErr")
    @Override
    public boolean use(@NotNull final NhsNumber valueToPsuedonymise, @NotNull final PsuedonymisedValues<NhsNumber> pseudonymisedValues)
    {
        out.print(valueToPsuedonymise.normalised());
        out.print(pseudonymisedValues.get(pseudonymiser));
        return true;
    }
});

The boolean result tells the iterate() method to continue looping.

Putting it altogether, a sample class might look like:

/*
 * © Crown Copyright 2013
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package uk.nhs.hdn.pseudonymisation;

import org.jetbrains.annotations.NotNull;
import uk.nhs.hdn.common.exceptions.ShouldNeverHappenException;
import uk.nhs.hdn.number.NhsNumber;
import uk.nhs.hdn.pseudonymisation.pseudonymisers.HashPseudonymiser;
import uk.nhs.hdn.pseudonymisation.pseudonymisers.Pseudonymiser;

import static java.lang.System.out;
import static uk.nhs.hdn.common.MessageDigestHelper.SHA512;
import static uk.nhs.hdn.number.NhsNumber.valueOf;

public class Example1
{
    public void example()
    {
        final NhsNumber nhsNumber = valueOf("1234567881");

        final IndexTable<NhsNumber> nhsNumberMapIndexTable = new MapIndexTable<>();
        final Pseudonymiser<NhsNumber> pseudonymiser = new HashPseudonymiser<>(SHA512);
        pseudonymiser.pseudonymise(nhsNumber, nhsNumberMapIndexTable);

        nhsNumberMapIndexTable.iterate(new PsuedonymisedValuesUser<NhsNumber, ShouldNeverHappenException>()
        {
            @SuppressWarnings("UseOfSystemOutOrSystemErr")
            @Override
            public boolean use(@NotNull final NhsNumber valueToPsuedonymise, @NotNull final PsuedonymisedValues<NhsNumber> pseudonymisedValues)
            {
                out.print(valueToPsuedonymise.normalised());
                out.print(pseudonymisedValues.get(pseudonymiser));
                return true;
            }
        });
    }
}

This is available in the class uk.nhs.hdn.pseudonymisation.Example1.