Anonymizer definitions are used by HL7Viewer and HL7Script to quickly anonymize or de-identify one or more messages that contain PHI, making them suitable for replay in a test environment. When anonymizing a series of messages, the changed data is persisted to keep the messages consistent. For example, if PID.3 (the patient ID) is changed from "12345" to "TEST001" in the first message, "12345" is changed to "TEST001" in all PID.3 fields.
A sample definition is included with the release, Generic.anon.ini. This definition is a good start, but should not be considered an authoritative guide to anonymization. It covers the segments regularly encountered in the author's experience, but it should be tested with your own messages and configured to make sure all instances of actual PHI have been anonymized, including any custom Z-segments.
A definition includes three (sometimes four) sections:
The [Global] section appears at the top and is used to set global options.
The [Values] section defines how to generate replacement values for various string, numeric, and date/time types.
The [Fields] section lists all the fields that require anonymization, and which value generator should be used for each. A field may also be replaced with another previously anonymized field.
The [Increments] section is maintained by the program if the SaveIncrements global option is enabled.
Below is a snippet of an example definition file:
; This is a comment
[Global]
Alphabet=BCDFGHJKLMNPQRSTVWXYZ
Persist=1
SaveIncrements=1
DataStore=D:\HL7\Example.anon.data
NamedFields=D:\HL7\HL7NamedFields.2.7.txt
[Values]
anyST=ST
anyNM=NM
anyDT=DT SameAge=1
MRN=NM Min=100000001 Increment=1 Prefix=M
Name=ST Min=3 Max=12
Street=ST Constant="123 ANON ST"
Zip=ST Constant=12345
Phone=ST Constant=(800)555-1212
Email=ST Constant=anon@example.com
SSN=NM Min=999101000 Increment=1 Mask=999-99-9999 Ignore=000-00-0000|999-99-9999
[Fields]
PID.3=MRN ;End-of-line comment
PID.5.1.1=Name
PID.5.1.2=Name
PID.5.2=Name
PID.5.3=anyST
PID.7=anyDT
PID.11.1=Street
PID.11.2=Blank
PID.11.3=Name
PID.11.5=Zip
PID.13.1=Phone
PID.13.4=Email
PID.14.1=Phone
PID.14.4=Email
PID.18.1=PID.3
PID.19=SSN
Blank lines are ignored. Comments start with a semicolon (;) and may be whole-line or end-of-line comments.
The default extension for a definition file is .anon.ini. It can be anything since they are just plain text files, but .anon.ini is what HL7Viewer and HL7Script look for first.
The following options may be specified in the Global Settings section:
A database may be used instead of the ini and datastore file for storing increment values and persisted data, respectively. The following global settings are used only when using a database, and override the file-based settings when provided.
The Anonymizer Database Schema section contains an example of how to create tables and procedures for working with anonymization data.
There are three types of value generators: strings (ST), numbers (NM), and dates (DT). Each value definition starts with a unique name, an equal sign, and one of the types.
; A bare minimum value definition
anyST=ST
Only the name and type are required, but there are numerous options to help generate an interesting value. Options are separated by spaces and are given in Option=Value format. If a value contains spaces or semicolons, enclose it in double quotes (").
; Quote option values that contain spaces or semicolons
Street=ST Constant="123 ANON ST"
There are two built-in value generators that are always available: Blank and Null. Those do exactly what you would think and set the value to blank or Null (""), respectively.
The available options vary based on the value type. If an option has a default value other than blank, it is shown in parentheses. Boolean values use 0 for False and 1 for True.
Date values generate a random date based on the options. If the input contains a time, the time remains unchanged.
The following options apply to all types, even when the value is a Constant. Ignore is always checked first to determine if the value should remain unchanged. After generating the value using the type-specific options, the general options are applied in the order they are provided in the definition. Each option may be specified only once.
Right-justifies/overlays a string of characters (usually digits) into a format
string. Especially handy for phone number/SSN formatting, but it could
conceivably be used on any type of input.
Ex: FormatDigits('6025551212', '(099)999-9999') -> '(602)555-1212'
FormatDigits('5551212', '(099)999-9999') -> '555-1212'
FormatDigits('6025551212', '999.999.9999') -> '602.555.1212'
FormatDigits('foo', 'bar') -> 'foobar'
All digit characters are always output even if the format string is shorter
or blank. Output stops when you run out of digit characters, even though there
may be more format string remaining.
Format string rules:
9 = Replace this character with a character from the digit string.
0 = Same as 9 but always includes the next format character to the left, even
if you have run out of digit characters.
* = Any other character is copied to the output as a literal.
To output a literal 0 or 9, precede it with the escape character. The default
escape character is a backslash, but it can be changed if you need backslashes
in your output.
Example: FormatDigits('123456', '999\0999') -> '1230456'
The Fields section contains a list of all fields, components, and subcomponents that require anonymization. Each line consists of a field key, an equal sign, and the name of a value generator or a previously anonymized field key to copy.
If a Named Fields file has been specified, named fields may be used in field definitions. Numeric keys are always valid, even when a Named Fields file has been loaded.
The following example applies the value generator called "MRN" to PID.3:
PID.3=MRN
This example copies the value generated for PID.3 into PID.18. Note that PID.3 must be defined in the Fields list before PID.18 to do this.
PID.18=PID.3
All repetitions in all like segments will be anonymized unless the key provides specific segment sequence and/or repetition indexes, e.g. NK1#1.5, PID.3~1. One example of a reason to include a specific repetition index would be if a sender always uses the third repetition of PID.13 for the email address. You would list the regular PID.13 anonymization first, then the specific repetition.
PID.13.1=Phone
PID.13~3.1=Email ;Vendor always puts email here
If copying a previously anonymized field and the value should be copied from the same segment sequence and/or repetition that is currently being anonymized, wildcards can be used. The wildcard character is a question mark (?) and can follow either a segment sequence (#) or repetition (~) marker. The question marks will be replaced with the appropriate indexes for the current field. Without a wildcard, the first such segment (#1) and repetition (~1) are assumed.
PID.18=PID#?.3~?.1 ;Copies PID.3.1 from the same segment and repetition of this PID.18
A field definition may also include one or more of the following options:
If more flexibility is required in choosing a replacement for a field, conditional logic in IF-THEN-ELSE format can be used to select the correct value generator or field to copy:
PID.13.1=IF PID#?.13~?.2 == "NET" THEN Email ELSE Phone
; If the SSN starts with "X" don't change it:
PID.19.1=IF PID.19.1 ~= "X" THEN IGNORE ELSE SSN
"IF " must immediately follow the equal sign. The IF portion of the expression uses the same syntax as HL7Script IF statements. The THEN and ELSE parts are both required, and must provide either a value generator name, a field key to copy, or the word IGNORE to leave the field unchanged. Segment and repetition wildcards work as they do in non-conditional assignments.
The THEN and ELSE parts can also nest additional conditional logic expressions within parentheses:
PID.13.1=IF PID#?.13~?.2=="NET" THEN Email ELSE (IF PID#?.13~?.3=="CP" THEN CellPhone ELSE Phone)
Nesting is effectively unlimited, but the entire expression must be contained on a single line.
The Persist and Ignore options are still available when using conditional logic, and must be the last options on the line when present.
Here is an example of a possible database schema for persisting anonymization data, including Global Options tailored to work with it.
If multiple threads or processes could be anonymizing data simultaneously, a threadsafe design using sequences/generators/identity columns should be developed. Those constructs guarantee that no two connections could retrieve the same increment value.
CREATE TABLE AnonStore (
fieldkey nvarchar(50) NOT NULL,
origdata nvarchar(250) NOT NULL,
anondata nvarchar(250) NOT NULL,
CONSTRAINT pk_AnonStore PRIMARY KEY (fieldkey, origdata)
)
GO
CREATE TABLE AnonInc (
valuename NVARCHAR(50) NOT NULL PRIMARY KEY,
lastincrement BIGINT NOT NULL
)
GO
CREATE PROCEDURE AnonIncrement(@valuename NVARCHAR(50), @inc BIGINT, @min BIGINT)
AS
BEGIN
DECLARE @last BIGINT
SELECT @last = lastincrement FROM AnonInc WHERE valuename = @valuename;
IF @last IS NULL
INSERT INTO AnonInc (valuename, lastincrement) VALUES (@valuename, @min);
ELSE BEGIN
SET @last = @last + @inc;
UPDATE AnonInc SET lastincrement = @last WHERE valuename = @valuename;
END
SELECT lastincrement FROM AnonInc WHERE valuename = @valuename;
END
GO
Database=(your connection name here)
IncrementSQL=EXEC AnonIncrement :ValueName, :ValueInc, :ValueMin;
DataReadSQL=SELECT anondata FROM AnonStore WHERE fieldkey = :FieldKey AND origdata = :OrigData;
DataSaveSQL=INSERT INTO AnonStore (fieldkey, origdata, anondata) VALUES (:FieldKey, :OrigData, :AnonData);