Automated validation and conversion of units used in clinical trials

Sam Tomioka

Feb 20, 2019

Introduction

The verification of scientific units and conversion from the reported units to standard units have been always challenging for Data Science due to several reasons:

1. Need a lookup table that consists of all possible input and output units for measurements, name of the measurements (e.g. Glucose, Weight, ...), conversion factors, molar weights etc.
2. The names of the measurement in the lookup table and incoming data must match
3. The incoming units must be in the lookup table
4. Maintenance of the lookup table must be synched with standard terminology update
5. Require careful medical review in addition to laborsome Data Science review
and more...

Despite the challenges, the lookup table approach is norm for many companies for verification of the units and conversion. Consideration was give for more systematic approach that does not require to use lab test names[1], but some units relies on molar weight and/or valence of ion of the specific lab tests so this approach does not solve the problem. The regulatory agencies require sponsor to use standardized units for reporting and analysis[2]. The PMDA requires SI units for all reporting and analysis[3,4]. The differences in requirement force us to maintain region specific conversion for some measurements which add additional complexity.

The approach Jozef Aerts discussed uses RestAPI request to the Unified Code for Units of Measure (UCUM) Resources from the US National Library of Medicine[5]. The benefit is obvious that we can potentially eliminate the maintenance of the lab conversion lookup table. Here is what they say about themself.

The Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols, but there is nothing that prevents it from being used in other types of machine communication.

The UCUM is the ISO 11240 compliant standard. FDA uses the UCUM syntax standard for dosage strength in both content of product labeling and drug establishment registration and drug listing.

Something to note:

The units used in API call has to be compliant with the USUM specifications. In addition, URL encoding has to be applied for some special characters. URL encoding can be found here. In this quick proof of concept, regular expression is used to update the input units to USUM compliant unit.

Brief summary of findings

6458 laboratory records were used to test USUM RestAPI. These records are from one of the ongoing clinical trial with standard set of clinical laboratory tests. Out of 6458 records, there were 321 records identified as incorrect conversions. Out of 322 findings, 169 was false positive which is due to lack of accounting valence of ion with respect to mEq to molar unit conversion.

-- Records
Total Records 6458
Identified as incorrect conversion 321
True Positive 153
False Positive 169

2142 records were identified as error. Out of 2142 errors, 120 records identified as error due to having a categorical data despite unit was given. There were 2022 records where the source and target unit do not have the same property. Most of them are cause by lack of mass-mol conversions, and the rest appeared to be correct but medical judgement would be neccessary.

Type of Error Records
ERROR: unexpected result: Error: Source and Target unit do not seem to belong to the same property 2022
ERROR: unexpected result: NEGATIVE is not a numeric value 119
ERROR: unexpected result: Negative is not a numeric value 1

Overall, this approach worked for majority of the records 6458, however, a few improvements are required by NLM/NIH to full utilize this RestAPI.

  1. Need for mass-mol conversions *
  2. Need to account for valence of ion with respect to mEq to molar unit conversion
  3. Need for an option to specify molar weight or LOINC code for accurate unit conversion with respect to molar unit.

*According to [Paul Lynch (NIH/NLM/LHC), addition of mass-mol conversions to the ucum-lhc library are nealy complete on their end.

Thought

This approach has potential and can be used for any units (PK, Lab, ECG, Vital Signs etc). With addition of mass-mol conversion in near future, we could replace the current lab conversion lookup. However, initial implementation of mass-mol conversion may require us to supply molar weight during API request which will prohibit us from completely get rid of the existing lookup table. Additional verifications such as LBCAT-LBTEST-LBTESTCD unique pair, on lab data will be required separately. Addition of LOINC is thus strongly desired.

In [1]:
import boto3
import botocore
import re
import os
import pandas as pd
import numpy as np
import urllib
import xml.etree.ElementTree as ET
bucket='snvn-sagemaker-1' #data bucket
s3 = boto3.resource('s3')
In [2]:
KEY='mldata/Sam/data/093-701/lb_q2.sas7bdat' 

try:
    s3.Bucket(bucket).download_file(KEY, 'data/raw_lb.sas7bdat')
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise 
In [9]:
rlb=pd.read_sas('data/raw_lb.sas7bdat',encoding='latin')
df=rlb[['LBTESTCD','LBTEST','LBORRES','LBORRESU','LBSTRESU','LBSTRESN']]
df=df[(df['LBORRESU']!='LBSTRESU')]
df.dropna(axis=0, subset=['LBORRESU'], inplace=True)
df['ge']=df['LBORRES'].str.findall(r'<')
df['LBORRES']=df['LBORRES'].str.replace(r'<','')
df.shape
Out[9]:
(6458, 7)

Get unique units and convert them to UCUM

In [4]:
#Regular expressions. --- update this based on raw data
patterns = [("%","%25"),
           ("\A[xX]10E", "10*"),
           ("IU", "%5BIU%5D"),
           ("\Anan", ""),
           ("Eq","eq"),
           ("V/V","L/L")
           ]
In [5]:
def convert_unit(df, patterns):
    original=pd.unique(df[['LBORRESU','LBSTRESU']].values.ravel('K'))
    original=original.tolist()
    #print("units in dataset: ", original)
    ucum=original
    def cleanlist(lst, regex, substitution):
        cleaned_list = [re.sub(regex, substitution, str(line)) for line in lst]
        return cleaned_list
    for x, sub in patterns:
        ucum=cleanlist(ucum, x, sub)
    b = dict(zip(original, ucum))
    df['LBORRESU']=df['LBORRESU'].map(b)
    df['LBSTRESU']=df['LBSTRESU'].map(b)
    mask = df['LBORRESU'] =='%'
    df.loc[mask, 'LBORRESU'] = '%{'+df['LBTESTCD']+'}'
    mask = df['LBSTRESU'] =='%'
    df.loc[mask, 'LBSTRESU'] = '%{'+df['LBTESTCD']+'}'
    checklist=[]
    for i in range(0,len(df)):
        data=df.iloc[i,0:5]
        tmp="https://ucum.nlm.nih.gov/ucum-service/v1/ucumtransform/"+str(data[2])+"/from/"+str(data[3])+"/to/"+str(data[4])
        checklist.append(tmp)  
    response=[]
    for i in range(0,len(checklist)):
        #print(checklist[i], i)
        with urllib.request.urlopen(checklist[i]) as res:
            context=ET.fromstring(res.read())
            for child in context:
                tmp1=[]
                if child.text!=None:
                    #print(child.text)
                    tmp1=[child.text,float('NaN'),float('NaN'),float('NaN')]
                elif child.text==None:
                    for element in child:
                        tmp1.append(element.text)
            response.append(tmp1)
    fromucmc=[float(response[x][3]) for x in range(len(response))]
    rawdata=df['LBSTRESN'].tolist()
    df['fromucmc']=fromucmc
    check=[i!=j for i, j in zip(fromucmc, rawdata)]
    return df[check]
In [6]:
issue=convert_unit(df, patterns)

Output Issues

In [7]:
issue[(issue['fromucmc'].notnull())]
Out[7]:
LBTESTCD LBTEST LBORRES LBORRESU LBSTRESU LBSTRESN ge fromucmc
2 HCT Hematocrit 43.1 %25 L/L 0.43 [] 0.431
24 MG Magnesium 1.6 meq/L mmol/L 0.81 [] 1.600
61 HCT Hematocrit 43.2 %25 L/L 0.43 [] 0.432
83 MG Magnesium 1.8 meq/L mmol/L 0.88 [] 1.800
110 HCT Hematocrit 38.9 %25 L/L 0.39 [] 0.389
132 MG Magnesium 1.5 meq/L mmol/L 0.76 [] 1.500
170 HCT Hematocrit 37.6 %25 L/L 0.38 [] 0.376
192 MG Magnesium 1.5 meq/L mmol/L 0.76 [] 1.500
216 HCT Hematocrit 42.9 %25 L/L 0.43 [] 0.429
238 MG Magnesium 1.5 meq/L mmol/L 0.77 [] 1.500
273 HCT Hematocrit 41.4 %25 L/L 0.41 [] 0.414
295 MG Magnesium 1.7 meq/L mmol/L 0.86 [] 1.700
341 MG Magnesium 1.7 meq/L mmol/L 0.83 [] 1.700
380 HCT Hematocrit 44.5 %25 L/L 0.45 [] 0.445
402 MG Magnesium 1.6 meq/L mmol/L 0.81 [] 1.600
429 HCT Hematocrit 41.2 %25 L/L 0.41 [] 0.412
473 MG Magnesium 1.7 meq/L mmol/L 0.84 [] 1.700
489 HCT Hematocrit 41.5 %25 L/L 0.42 [] 0.415
511 MG Magnesium 1.6 meq/L mmol/L 0.80 [] 1.600
534 HCT Hematocrit 45.3 %25 L/L 0.45 [] 0.453
556 MG Magnesium 1.8 meq/L mmol/L 0.92 [] 1.800
590 HCT Hematocrit 44.8 %25 L/L 0.45 [] 0.448
612 MG Magnesium 1.6 meq/L mmol/L 0.82 [] 1.600
646 HCT Hematocrit 45.6 %25 L/L 0.46 [] 0.456
668 MG Magnesium 1.7 meq/L mmol/L 0.86 [] 1.700
713 MG Magnesium 1.7 meq/L mmol/L 0.86 [] 1.700
749 HCT Hematocrit 46.6 %25 L/L 0.47 [] 0.466
771 MG Magnesium 1.7 meq/L mmol/L 0.83 [] 1.700
787 HCT Hematocrit 47.3 %25 L/L 0.47 [] 0.473
803 HCT Hematocrit 40.9 %25 L/L 0.41 [] 0.409
... ... ... ... ... ... ... ... ...
8563 HCT Hematocrit 43.9 %25 L/L 0.44 [] 0.439
8585 MG Magnesium 1.7 meq/L mmol/L 0.83 [] 1.700
8624 HCT Hematocrit 49.1 %25 L/L 0.49 [] 0.491
8646 MG Magnesium 1.7 meq/L mmol/L 0.87 [] 1.700
8683 HCT Hematocrit 34.1 %25 L/L 0.34 [] 0.341
8705 MG Magnesium 1.7 meq/L mmol/L 0.86 [] 1.700
8744 HCT Hematocrit 43.6 %25 L/L 0.44 [] 0.436
8766 MG Magnesium 1.6 meq/L mmol/L 0.81 [] 1.600
8804 HCT Hematocrit 38.1 %25 L/L 0.38 [] 0.381
8826 MG Magnesium 1.7 meq/L mmol/L 0.86 [] 1.700
8865 HCT Hematocrit 45.9 %25 L/L 0.46 [] 0.459
8887 MG Magnesium 2.1 meq/L mmol/L 1.03 [] 2.100
8926 HCT Hematocrit 39.8 %25 L/L 0.40 [] 0.398
8948 MG Magnesium 1.7 meq/L mmol/L 0.83 [] 1.700
8987 HCT Hematocrit 40.6 %25 L/L 0.41 [] 0.406
9009 MG Magnesium 1.6 meq/L mmol/L 0.80 [] 1.600
9050 HCT Hematocrit 46.8 %25 L/L 0.47 [] 0.468
9072 MG Magnesium 1.9 meq/L mmol/L 0.96 [] 1.900
9109 HCT Hematocrit 40.1 %25 L/L 0.40 [] 0.401
9131 MG Magnesium 1.3 meq/L mmol/L 0.67 [] 1.300
9172 HCT Hematocrit 43.9 %25 L/L 0.44 [] 0.439
9194 MG Magnesium 1.8 meq/L mmol/L 0.92 [] 1.800
9241 HCT Hematocrit 37.2 %25 L/L 0.37 [] 0.372
9263 MG Magnesium 1.6 meq/L mmol/L 0.80 [] 1.600
9300 HCT Hematocrit 45.7 %25 L/L 0.46 [] 0.457
9322 MG Magnesium 1.8 meq/L mmol/L 0.92 [] 1.800
9372 HCT Hematocrit 39.9 %25 L/L 0.40 [] 0.399
9394 MG Magnesium 1.7 meq/L mmol/L 0.84 [] 1.700
9434 HCT Hematocrit 38.1 %25 L/L 0.38 [] 0.381
9456 MG Magnesium 1.6 meq/L mmol/L 0.80 [] 1.600

322 rows × 8 columns

In [8]:
issue[(issue['fromucmc'].isnull())]
Out[8]:
LBTESTCD LBTEST LBORRES LBORRESU LBSTRESU LBSTRESN ge fromucmc
19 UREAN Urea Nitrogen 12 mg/dL mmol/L 4.21 [] NaN
20 CREAT Creatinine 0.79 mg/dL umol/L 70.00 [] NaN
21 GLUC Glucose 73 mg/dL mmol/L 4.10 [] NaN
22 CA Calcium 9.3 mg/dL mmol/L 2.34 [] NaN
23 PHOS Phosphate 3.7 mg/dL mmol/L 1.20 [] NaN
27 BILI Bilirubin 0.2 mg/dL umol/L 3.00 [] NaN
28 BILDIR Direct Bilirubin 0.1 mg/dL umol/L NaN [<] NaN
29 BILIND Indirect Bilirubin 0.1 mg/dL umol/L 2.00 [] NaN
32 ALP Alkaline Phosphatase 113 U/L %5BIU%5D/L 113.00 [] NaN
33 URATE Urate 3.8 mg/dL umol/L 226.00 [] NaN
37 T4FR Thyroxine; Free 1.19 ng/dL pmol/L 15.40 [] NaN
38 T3FR Triiodothyronine; Free 3.14 pg/mL pmol/L 4.80 [] NaN
57 OXYCDN Oxycodone NEGATIVE ng/mL ng/mL NaN [] NaN
78 UREAN Urea Nitrogen 10 mg/dL mmol/L 3.61 [] NaN
79 CREAT Creatinine 0.63 mg/dL umol/L 56.00 [] NaN
80 GLUC Glucose 65 mg/dL mmol/L 3.60 [] NaN
81 CA Calcium 9.6 mg/dL mmol/L 2.39 [] NaN
82 PHOS Phosphate 2.6 mg/dL mmol/L 0.84 [] NaN
86 BILI Bilirubin 0.2 mg/dL umol/L 3.00 [] NaN
87 BILDIR Direct Bilirubin 0.1 mg/dL umol/L NaN [<] NaN
88 BILIND Indirect Bilirubin 0.1 mg/dL umol/L 2.00 [] NaN
91 ALP Alkaline Phosphatase 123 U/L %5BIU%5D/L 123.00 [] NaN
92 URATE Urate 3.0 mg/dL umol/L 178.00 [] NaN
95 T4FR Thyroxine; Free 0.82 ng/dL pmol/L 10.50 [] NaN
96 T3FR Triiodothyronine; Free 2.59 pg/mL pmol/L 4.00 [] NaN
127 UREAN Urea Nitrogen 12 mg/dL mmol/L 4.43 [] NaN
128 CREAT Creatinine 0.77 mg/dL umol/L 68.00 [] NaN
129 GLUC Glucose 115 mg/dL mmol/L 6.40 [] NaN
130 CA Calcium 9.0 mg/dL mmol/L 2.26 [] NaN
131 PHOS Phosphate 3.6 mg/dL mmol/L 1.16 [] NaN
... ... ... ... ... ... ... ... ...
9331 URATE Urate 4.2 mg/dL umol/L 250.00 [] NaN
9334 T4FR Thyroxine; Free 1.37 ng/dL pmol/L 17.60 [] NaN
9335 T3FR Triiodothyronine; Free 4.73 pg/mL pmol/L 7.30 [] NaN
9355 OXYCDN Oxycodone NEGATIVE ng/mL ng/mL NaN [] NaN
9389 UREAN Urea Nitrogen 7 mg/dL mmol/L 2.61 [] NaN
9390 CREAT Creatinine 0.81 mg/dL umol/L 72.00 [] NaN
9391 GLUC Glucose 70 mg/dL mmol/L 3.90 [] NaN
9392 CA Calcium 9.1 mg/dL mmol/L 2.28 [] NaN
9393 PHOS Phosphate 2.5 mg/dL mmol/L 0.81 [] NaN
9397 BILI Bilirubin 0.4 mg/dL umol/L 7.00 [] NaN
9398 BILDIR Direct Bilirubin 0.1 mg/dL umol/L 2.00 [] NaN
9399 BILIND Indirect Bilirubin 0.3 mg/dL umol/L 5.00 [] NaN
9402 ALP Alkaline Phosphatase 85 U/L %5BIU%5D/L 85.00 [] NaN
9403 URATE Urate 3.3 mg/dL umol/L 196.00 [] NaN
9407 T4FR Thyroxine; Free 1.00 ng/dL pmol/L 12.90 [] NaN
9408 T3FR Triiodothyronine; Free 3.29 pg/mL pmol/L 5.00 [] NaN
9429 OXYCDN Oxycodone NEGATIVE ng/mL ng/mL NaN [] NaN
9451 UREAN Urea Nitrogen 12 mg/dL mmol/L 4.11 [] NaN
9452 CREAT Creatinine 0.74 mg/dL umol/L 65.00 [] NaN
9453 GLUC Glucose 87 mg/dL mmol/L 4.80 [] NaN
9454 CA Calcium 9.8 mg/dL mmol/L 2.44 [] NaN
9455 PHOS Phosphate 3.9 mg/dL mmol/L 1.26 [] NaN
9459 BILI Bilirubin 0.2 mg/dL umol/L 3.00 [] NaN
9460 BILDIR Direct Bilirubin 0.1 mg/dL umol/L NaN [<] NaN
9461 BILIND Indirect Bilirubin 0.1 mg/dL umol/L 2.00 [] NaN
9464 ALP Alkaline Phosphatase 70 U/L %5BIU%5D/L 70.00 [] NaN
9465 URATE Urate 4.6 mg/dL umol/L 274.00 [] NaN
9469 T4FR Thyroxine; Free 1.13 ng/dL pmol/L 14.60 [] NaN
9470 T3FR Triiodothyronine; Free 3.12 pg/mL pmol/L 4.80 [] NaN
9489 OXYCDN Oxycodone NEGATIVE ng/mL ng/mL NaN [] NaN

2142 rows × 8 columns