Automated validation and conversion of units used in clinical trials¶

Sam Tomioka

Feb 20, 2019

Introduction¶

The verification of scientific units and conversion from the reported units to standard units have been always challenging for Data Science due to several reasons:

1. Need a lookup table that consists of all possible input and output units for measurements, name of the measurements (e.g. Glucose, Weight, ...), conversion factors, molar weights etc.
2. The names of the measurement in the lookup table and incoming data must match
3. The incoming units must be in the lookup table
4. Maintenance of the lookup table must be synched with standard terminology update
5. Require careful medical review in addition to laborsome Data Science review
and more...

Despite the challenges, the lookup table approach is norm for many companies for verification of the units and conversion. Consideration was give for more systematic approach that does not require to use lab test names[1], but some units relies on molar weight and/or valence of ion of the specific lab tests so this approach does not solve the problem. The regulatory agencies require sponsor to use standardized units for reporting and analysis[2]. The PMDA requires SI units for all reporting and analysis[3,4]. The differences in requirement force us to maintain region specific conversion for some measurements which add additional complexity.

The approach Jozef Aerts discussed uses RestAPI request to the Unified Code for Units of Measure (UCUM) Resources from the US National Library of Medicine[5]. The benefit is obvious that we can potentially eliminate the maintenance of the lab conversion lookup table. Here is what they say about themself.

The Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols, but there is nothing that prevents it from being used in other types of machine communication.

The UCUM is the ISO 11240 compliant standard. FDA uses the UCUM syntax standard for dosage strength in both content of product labeling and drug establishment registration and drug listing.

Something to note:

The units used in API call has to be compliant with the USUM specifications. In addition, URL encoding has to be applied for some special characters. URL encoding can be found here. In this quick proof of concept, regular expression is used to update the input units to USUM compliant unit.

Brief summary of findings¶

6458 laboratory records were used to test USUM RestAPI. These records are from one of the ongoing clinical trial with standard set of clinical laboratory tests. Out of 6458 records, there were 321 records identified as incorrect conversions. Out of 322 findings, 169 was false positive which is due to lack of accounting valence of ion with respect to mEq to molar unit conversion.

--	Records
Total Records	6458
Identified as incorrect conversion	321
True Positive	153
False Positive	169

2142 records were identified as error. Out of 2142 errors, 120 records identified as error due to having a categorical data despite unit was given. There were 2022 records where the source and target unit do not have the same property. Most of them are cause by lack of mass-mol conversions, and the rest appeared to be correct but medical judgement would be neccessary.

Type of Error	Records
ERROR: unexpected result: Error: Source and Target unit do not seem to belong to the same property	2022
ERROR: unexpected result: NEGATIVE is not a numeric value	119
ERROR: unexpected result: Negative is not a numeric value	1

Overall, this approach worked for majority of the records 6458, however, a few improvements are required by NLM/NIH to full utilize this RestAPI.

Need for mass-mol conversions *
Need to account for valence of ion with respect to mEq to molar unit conversion
Need for an option to specify molar weight or LOINC code for accurate unit conversion with respect to molar unit.

*According to [Paul Lynch (NIH/NLM/LHC), addition of mass-mol conversions to the ucum-lhc library are nealy complete on their end.

Thought¶

This approach has potential and can be used for any units (PK, Lab, ECG, Vital Signs etc). With addition of mass-mol conversion in near future, we could replace the current lab conversion lookup. However, initial implementation of mass-mol conversion may require us to supply molar weight during API request which will prohibit us from completely get rid of the existing lookup table. Additional verifications such as LBCAT-LBTEST-LBTESTCD unique pair, on lab data will be required separately. Addition of LOINC is thus strongly desired.

[1] Wu and Wales (2017) Laboratory Data Standardization with SAS . PharmaSUG

[2] FDA (2013). Position on Use of SI Units for Lab Tests - FDA

[3] PMDA. (2015) Notification on Practical Operations of Electronic Study Data Submissions

[4] PMDA (2017). FAQs on Electronic Study Data Submission (Excerpt)

[5] Jozef Aerts (2019) SDTM --STRESN: why we need UCUM

import boto3
import botocore
import re
import os
import pandas as pd
import numpy as np
import urllib
import xml.etree.ElementTree as ET
bucket='snvn-sagemaker-1' #data bucket
s3 = boto3.resource('s3')

KEY='mldata/Sam/data/093-701/lb_q2.sas7bdat' 

try:
    s3.Bucket(bucket).download_file(KEY, 'data/raw_lb.sas7bdat')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

rlb=pd.read_sas('data/raw_lb.sas7bdat',encoding='latin')
df=rlb[['LBTESTCD','LBTEST','LBORRES','LBORRESU','LBSTRESU','LBSTRESN']]
df=df[(df['LBORRESU']!='LBSTRESU')]
df.dropna(axis=0, subset=['LBORRESU'], inplace=True)
df['ge']=df['LBORRES'].str.findall(r'<')
df['LBORRES']=df['LBORRES'].str.replace(r'<','')
df.shape

(6458, 7)

Get unique units and convert them to UCUM

#Regular expressions. --- update this based on raw data
patterns = [("%","%25"),
           ("\A[xX]10E", "10*"),
           ("IU", "%5BIU%5D"),
           ("\Anan", ""),
           ("Eq","eq"),
           ("V/V","L/L")
           ]

def convert_unit(df, patterns):
    original=pd.unique(df[['LBORRESU','LBSTRESU']].values.ravel('K'))
    original=original.tolist()
    #print("units in dataset: ", original)
    ucum=original
    def cleanlist(lst, regex, substitution):
        cleaned_list = [re.sub(regex, substitution, str(line)) for line in lst]
        return cleaned_list
    for x, sub in patterns:
        ucum=cleanlist(ucum, x, sub)
    b = dict(zip(original, ucum))
    df['LBORRESU']=df['LBORRESU'].map(b)
    df['LBSTRESU']=df['LBSTRESU'].map(b)
    mask = df['LBORRESU'] =='%'
    df.loc[mask, 'LBORRESU'] = '%{'+df['LBTESTCD']+'}'
    mask = df['LBSTRESU'] =='%'
    df.loc[mask, 'LBSTRESU'] = '%{'+df['LBTESTCD']+'}'
    checklist=[]
    for i in range(0,len(df)):
        data=df.iloc[i,0:5]
        tmp="https://ucum.nlm.nih.gov/ucum-service/v1/ucumtransform/"+str(data[2])+"/from/"+str(data[3])+"/to/"+str(data[4])
        checklist.append(tmp)  
    response=[]
    for i in range(0,len(checklist)):
        #print(checklist[i], i)
        with urllib.request.urlopen(checklist[i]) as res:
            context=ET.fromstring(res.read())
            for child in context:
                tmp1=[]
                if child.text!=None:
                    #print(child.text)
                    tmp1=[child.text,float('NaN'),float('NaN'),float('NaN')]
                elif child.text==None:
                    for element in child:
                        tmp1.append(element.text)
            response.append(tmp1)
    fromucmc=[float(response[x][3]) for x in range(len(response))]
    rawdata=df['LBSTRESN'].tolist()
    df['fromucmc']=fromucmc
    check=[i!=j for i, j in zip(fromucmc, rawdata)]
    return df[check]

issue=convert_unit(df, patterns)

Output Issues¶

issue[(issue['fromucmc'].notnull())]

issue[(issue['fromucmc'].isnull())]

	LBTESTCD	LBTEST	LBORRES	LBORRESU	LBSTRESU	LBSTRESN	ge	fromucmc
2	HCT	Hematocrit	43.1	%25	L/L	0.43	[]	0.431
24	MG	Magnesium	1.6	meq/L	mmol/L	0.81	[]	1.600
61	HCT	Hematocrit	43.2	%25	L/L	0.43	[]	0.432
83	MG	Magnesium	1.8	meq/L	mmol/L	0.88	[]	1.800
110	HCT	Hematocrit	38.9	%25	L/L	0.39	[]	0.389
132	MG	Magnesium	1.5	meq/L	mmol/L	0.76	[]	1.500
170	HCT	Hematocrit	37.6	%25	L/L	0.38	[]	0.376
192	MG	Magnesium	1.5	meq/L	mmol/L	0.76	[]	1.500
216	HCT	Hematocrit	42.9	%25	L/L	0.43	[]	0.429
238	MG	Magnesium	1.5	meq/L	mmol/L	0.77	[]	1.500
273	HCT	Hematocrit	41.4	%25	L/L	0.41	[]	0.414
295	MG	Magnesium	1.7	meq/L	mmol/L	0.86	[]	1.700
341	MG	Magnesium	1.7	meq/L	mmol/L	0.83	[]	1.700
380	HCT	Hematocrit	44.5	%25	L/L	0.45	[]	0.445
402	MG	Magnesium	1.6	meq/L	mmol/L	0.81	[]	1.600
429	HCT	Hematocrit	41.2	%25	L/L	0.41	[]	0.412
473	MG	Magnesium	1.7	meq/L	mmol/L	0.84	[]	1.700
489	HCT	Hematocrit	41.5	%25	L/L	0.42	[]	0.415
511	MG	Magnesium	1.6	meq/L	mmol/L	0.80	[]	1.600
534	HCT	Hematocrit	45.3	%25	L/L	0.45	[]	0.453
556	MG	Magnesium	1.8	meq/L	mmol/L	0.92	[]	1.800
590	HCT	Hematocrit	44.8	%25	L/L	0.45	[]	0.448
612	MG	Magnesium	1.6	meq/L	mmol/L	0.82	[]	1.600
646	HCT	Hematocrit	45.6	%25	L/L	0.46	[]	0.456
668	MG	Magnesium	1.7	meq/L	mmol/L	0.86	[]	1.700
713	MG	Magnesium	1.7	meq/L	mmol/L	0.86	[]	1.700
749	HCT	Hematocrit	46.6	%25	L/L	0.47	[]	0.466
771	MG	Magnesium	1.7	meq/L	mmol/L	0.83	[]	1.700
787	HCT	Hematocrit	47.3	%25	L/L	0.47	[]	0.473
803	HCT	Hematocrit	40.9	%25	L/L	0.41	[]	0.409
...	...	...	...	...	...	...	...	...
8563	HCT	Hematocrit	43.9	%25	L/L	0.44	[]	0.439
8585	MG	Magnesium	1.7	meq/L	mmol/L	0.83	[]	1.700
8624	HCT	Hematocrit	49.1	%25	L/L	0.49	[]	0.491
8646	MG	Magnesium	1.7	meq/L	mmol/L	0.87	[]	1.700
8683	HCT	Hematocrit	34.1	%25	L/L	0.34	[]	0.341
8705	MG	Magnesium	1.7	meq/L	mmol/L	0.86	[]	1.700
8744	HCT	Hematocrit	43.6	%25	L/L	0.44	[]	0.436
8766	MG	Magnesium	1.6	meq/L	mmol/L	0.81	[]	1.600
8804	HCT	Hematocrit	38.1	%25	L/L	0.38	[]	0.381
8826	MG	Magnesium	1.7	meq/L	mmol/L	0.86	[]	1.700
8865	HCT	Hematocrit	45.9	%25	L/L	0.46	[]	0.459
8887	MG	Magnesium	2.1	meq/L	mmol/L	1.03	[]	2.100
8926	HCT	Hematocrit	39.8	%25	L/L	0.40	[]	0.398
8948	MG	Magnesium	1.7	meq/L	mmol/L	0.83	[]	1.700
8987	HCT	Hematocrit	40.6	%25	L/L	0.41	[]	0.406
9009	MG	Magnesium	1.6	meq/L	mmol/L	0.80	[]	1.600
9050	HCT	Hematocrit	46.8	%25	L/L	0.47	[]	0.468
9072	MG	Magnesium	1.9	meq/L	mmol/L	0.96	[]	1.900
9109	HCT	Hematocrit	40.1	%25	L/L	0.40	[]	0.401
9131	MG	Magnesium	1.3	meq/L	mmol/L	0.67	[]	1.300
9172	HCT	Hematocrit	43.9	%25	L/L	0.44	[]	0.439
9194	MG	Magnesium	1.8	meq/L	mmol/L	0.92	[]	1.800
9241	HCT	Hematocrit	37.2	%25	L/L	0.37	[]	0.372
9263	MG	Magnesium	1.6	meq/L	mmol/L	0.80	[]	1.600
9300	HCT	Hematocrit	45.7	%25	L/L	0.46	[]	0.457
9322	MG	Magnesium	1.8	meq/L	mmol/L	0.92	[]	1.800
9372	HCT	Hematocrit	39.9	%25	L/L	0.40	[]	0.399
9394	MG	Magnesium	1.7	meq/L	mmol/L	0.84	[]	1.700
9434	HCT	Hematocrit	38.1	%25	L/L	0.38	[]	0.381
9456	MG	Magnesium	1.6	meq/L	mmol/L	0.80	[]	1.600

	LBTESTCD	LBTEST	LBORRES	LBORRESU	LBSTRESU	LBSTRESN	ge	fromucmc
19	UREAN	Urea Nitrogen	12	mg/dL	mmol/L	4.21	[]	NaN
20	CREAT	Creatinine	0.79	mg/dL	umol/L	70.00	[]	NaN
21	GLUC	Glucose	73	mg/dL	mmol/L	4.10	[]	NaN
22	CA	Calcium	9.3	mg/dL	mmol/L	2.34	[]	NaN
23	PHOS	Phosphate	3.7	mg/dL	mmol/L	1.20	[]	NaN
27	BILI	Bilirubin	0.2	mg/dL	umol/L	3.00	[]	NaN
28	BILDIR	Direct Bilirubin	0.1	mg/dL	umol/L	NaN	[<]	NaN
29	BILIND	Indirect Bilirubin	0.1	mg/dL	umol/L	2.00	[]	NaN
32	ALP	Alkaline Phosphatase	113	U/L	%5BIU%5D/L	113.00	[]	NaN
33	URATE	Urate	3.8	mg/dL	umol/L	226.00	[]	NaN
37	T4FR	Thyroxine; Free	1.19	ng/dL	pmol/L	15.40	[]	NaN
38	T3FR	Triiodothyronine; Free	3.14	pg/mL	pmol/L	4.80	[]	NaN
57	OXYCDN	Oxycodone	NEGATIVE	ng/mL	ng/mL	NaN	[]	NaN
78	UREAN	Urea Nitrogen	10	mg/dL	mmol/L	3.61	[]	NaN
79	CREAT	Creatinine	0.63	mg/dL	umol/L	56.00	[]	NaN
80	GLUC	Glucose	65	mg/dL	mmol/L	3.60	[]	NaN
81	CA	Calcium	9.6	mg/dL	mmol/L	2.39	[]	NaN
82	PHOS	Phosphate	2.6	mg/dL	mmol/L	0.84	[]	NaN
86	BILI	Bilirubin	0.2	mg/dL	umol/L	3.00	[]	NaN
87	BILDIR	Direct Bilirubin	0.1	mg/dL	umol/L	NaN	[<]	NaN
88	BILIND	Indirect Bilirubin	0.1	mg/dL	umol/L	2.00	[]	NaN
91	ALP	Alkaline Phosphatase	123	U/L	%5BIU%5D/L	123.00	[]	NaN
92	URATE	Urate	3.0	mg/dL	umol/L	178.00	[]	NaN
95	T4FR	Thyroxine; Free	0.82	ng/dL	pmol/L	10.50	[]	NaN
96	T3FR	Triiodothyronine; Free	2.59	pg/mL	pmol/L	4.00	[]	NaN
127	UREAN	Urea Nitrogen	12	mg/dL	mmol/L	4.43	[]	NaN
128	CREAT	Creatinine	0.77	mg/dL	umol/L	68.00	[]	NaN
129	GLUC	Glucose	115	mg/dL	mmol/L	6.40	[]	NaN
130	CA	Calcium	9.0	mg/dL	mmol/L	2.26	[]	NaN
131	PHOS	Phosphate	3.6	mg/dL	mmol/L	1.16	[]	NaN
...	...	...	...	...	...	...	...	...
9331	URATE	Urate	4.2	mg/dL	umol/L	250.00	[]	NaN
9334	T4FR	Thyroxine; Free	1.37	ng/dL	pmol/L	17.60	[]	NaN
9335	T3FR	Triiodothyronine; Free	4.73	pg/mL	pmol/L	7.30	[]	NaN
9355	OXYCDN	Oxycodone	NEGATIVE	ng/mL	ng/mL	NaN	[]	NaN
9389	UREAN	Urea Nitrogen	7	mg/dL	mmol/L	2.61	[]	NaN
9390	CREAT	Creatinine	0.81	mg/dL	umol/L	72.00	[]	NaN
9391	GLUC	Glucose	70	mg/dL	mmol/L	3.90	[]	NaN
9392	CA	Calcium	9.1	mg/dL	mmol/L	2.28	[]	NaN
9393	PHOS	Phosphate	2.5	mg/dL	mmol/L	0.81	[]	NaN
9397	BILI	Bilirubin	0.4	mg/dL	umol/L	7.00	[]	NaN
9398	BILDIR	Direct Bilirubin	0.1	mg/dL	umol/L	2.00	[]	NaN
9399	BILIND	Indirect Bilirubin	0.3	mg/dL	umol/L	5.00	[]	NaN
9402	ALP	Alkaline Phosphatase	85	U/L	%5BIU%5D/L	85.00	[]	NaN
9403	URATE	Urate	3.3	mg/dL	umol/L	196.00	[]	NaN
9407	T4FR	Thyroxine; Free	1.00	ng/dL	pmol/L	12.90	[]	NaN
9408	T3FR	Triiodothyronine; Free	3.29	pg/mL	pmol/L	5.00	[]	NaN
9429	OXYCDN	Oxycodone	NEGATIVE	ng/mL	ng/mL	NaN	[]	NaN
9451	UREAN	Urea Nitrogen	12	mg/dL	mmol/L	4.11	[]	NaN
9452	CREAT	Creatinine	0.74	mg/dL	umol/L	65.00	[]	NaN
9453	GLUC	Glucose	87	mg/dL	mmol/L	4.80	[]	NaN
9454	CA	Calcium	9.8	mg/dL	mmol/L	2.44	[]	NaN
9455	PHOS	Phosphate	3.9	mg/dL	mmol/L	1.26	[]	NaN
9459	BILI	Bilirubin	0.2	mg/dL	umol/L	3.00	[]	NaN
9460	BILDIR	Direct Bilirubin	0.1	mg/dL	umol/L	NaN	[<]	NaN
9461	BILIND	Indirect Bilirubin	0.1	mg/dL	umol/L	2.00	[]	NaN
9464	ALP	Alkaline Phosphatase	70	U/L	%5BIU%5D/L	70.00	[]	NaN
9465	URATE	Urate	4.6	mg/dL	umol/L	274.00	[]	NaN
9469	T4FR	Thyroxine; Free	1.13	ng/dL	pmol/L	14.60	[]	NaN
9470	T3FR	Triiodothyronine; Free	3.12	pg/mL	pmol/L	4.80	[]	NaN
9489	OXYCDN	Oxycodone	NEGATIVE	ng/mL	ng/mL	NaN	[]	NaN