Sam Tomioka
Feb 20, 2019
The verification of scientific units and conversion from the reported units to standard units have been always challenging for Data Science due to several reasons:
1. Need a lookup table that consists of all possible input and output units for measurements, name of the measurements (e.g. Glucose, Weight, ...), conversion factors, molar weights etc.
2. The names of the measurement in the lookup table and incoming data must match
3. The incoming units must be in the lookup table
4. Maintenance of the lookup table must be synched with standard terminology update
5. Require careful medical review in addition to laborsome Data Science review
and more...
Despite the challenges, the lookup table approach is norm for many companies for verification of the units and conversion. Consideration was give for more systematic approach that does not require to use lab test names[1], but some units relies on molar weight and/or valence of ion of the specific lab tests so this approach does not solve the problem. The regulatory agencies require sponsor to use standardized units for reporting and analysis[2]. The PMDA requires SI units for all reporting and analysis[3,4]. The differences in requirement force us to maintain region specific conversion for some measurements which add additional complexity.
The approach Jozef Aerts discussed uses RestAPI request to the Unified Code for Units of Measure (UCUM) Resources from the US National Library of Medicine[5]. The benefit is obvious that we can potentially eliminate the maintenance of the lab conversion lookup table. Here is what they say about themself.
The Unified Code for Units of Measure (UCUM) is a code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols, but there is nothing that prevents it from being used in other types of machine communication.
The UCUM is the ISO 11240 compliant standard. FDA uses the UCUM syntax standard for dosage strength in both content of product labeling and drug establishment registration and drug listing.
Something to note:
The units used in API call has to be compliant with the USUM specifications. In addition, URL encoding has to be applied for some special characters. URL encoding can be found here. In this quick proof of concept, regular expression is used to update the input units to USUM compliant unit.
6458 laboratory records were used to test USUM RestAPI. These records are from one of the ongoing clinical trial with standard set of clinical laboratory tests. Out of 6458 records, there were 321 records identified as incorrect conversions. Out of 322 findings, 169 was false positive which is due to lack of accounting valence of ion with respect to mEq to molar unit conversion.
-- | Records |
---|---|
Total Records | 6458 |
Identified as incorrect conversion | 321 |
True Positive | 153 |
False Positive | 169 |
2142 records were identified as error. Out of 2142 errors, 120 records identified as error due to having a categorical data despite unit was given. There were 2022 records where the source and target unit do not have the same property. Most of them are cause by lack of mass-mol conversions, and the rest appeared to be correct but medical judgement would be neccessary.
Type of Error | Records |
---|---|
ERROR: unexpected result: Error: Source and Target unit do not seem to belong to the same property | 2022 |
ERROR: unexpected result: NEGATIVE is not a numeric value | 119 |
ERROR: unexpected result: Negative is not a numeric value | 1 |
Overall, this approach worked for majority of the records 6458, however, a few improvements are required by NLM/NIH to full utilize this RestAPI.
*According to [Paul Lynch (NIH/NLM/LHC), addition of mass-mol conversions to the ucum-lhc library are nealy complete on their end.
This approach has potential and can be used for any units (PK, Lab, ECG, Vital Signs etc). With addition of mass-mol conversion in near future, we could replace the current lab conversion lookup. However, initial implementation of mass-mol conversion may require us to supply molar weight during API request which will prohibit us from completely get rid of the existing lookup table. Additional verifications such as LBCAT-LBTEST-LBTESTCD unique pair, on lab data will be required separately. Addition of LOINC is thus strongly desired.
[1] Wu and Wales (2017) Laboratory Data Standardization with SAS . PharmaSUG
[2] FDA (2013). Position on Use of SI Units for Lab Tests - FDA
[3] PMDA. (2015) Notification on Practical Operations of Electronic Study Data Submissions
[4] PMDA (2017). FAQs on Electronic Study Data Submission (Excerpt)
[5] Jozef Aerts (2019) SDTM --STRESN: why we need UCUM
import boto3
import botocore
import re
import os
import pandas as pd
import numpy as np
import urllib
import xml.etree.ElementTree as ET
bucket='snvn-sagemaker-1' #data bucket
s3 = boto3.resource('s3')
KEY='mldata/Sam/data/093-701/lb_q2.sas7bdat'
try:
s3.Bucket(bucket).download_file(KEY, 'data/raw_lb.sas7bdat')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
rlb=pd.read_sas('data/raw_lb.sas7bdat',encoding='latin')
df=rlb[['LBTESTCD','LBTEST','LBORRES','LBORRESU','LBSTRESU','LBSTRESN']]
df=df[(df['LBORRESU']!='LBSTRESU')]
df.dropna(axis=0, subset=['LBORRESU'], inplace=True)
df['ge']=df['LBORRES'].str.findall(r'<')
df['LBORRES']=df['LBORRES'].str.replace(r'<','')
df.shape
Get unique units and convert them to UCUM
#Regular expressions. --- update this based on raw data
patterns = [("%","%25"),
("\A[xX]10E", "10*"),
("IU", "%5BIU%5D"),
("\Anan", ""),
("Eq","eq"),
("V/V","L/L")
]
def convert_unit(df, patterns):
original=pd.unique(df[['LBORRESU','LBSTRESU']].values.ravel('K'))
original=original.tolist()
#print("units in dataset: ", original)
ucum=original
def cleanlist(lst, regex, substitution):
cleaned_list = [re.sub(regex, substitution, str(line)) for line in lst]
return cleaned_list
for x, sub in patterns:
ucum=cleanlist(ucum, x, sub)
b = dict(zip(original, ucum))
df['LBORRESU']=df['LBORRESU'].map(b)
df['LBSTRESU']=df['LBSTRESU'].map(b)
mask = df['LBORRESU'] =='%'
df.loc[mask, 'LBORRESU'] = '%{'+df['LBTESTCD']+'}'
mask = df['LBSTRESU'] =='%'
df.loc[mask, 'LBSTRESU'] = '%{'+df['LBTESTCD']+'}'
checklist=[]
for i in range(0,len(df)):
data=df.iloc[i,0:5]
tmp="https://ucum.nlm.nih.gov/ucum-service/v1/ucumtransform/"+str(data[2])+"/from/"+str(data[3])+"/to/"+str(data[4])
checklist.append(tmp)
response=[]
for i in range(0,len(checklist)):
#print(checklist[i], i)
with urllib.request.urlopen(checklist[i]) as res:
context=ET.fromstring(res.read())
for child in context:
tmp1=[]
if child.text!=None:
#print(child.text)
tmp1=[child.text,float('NaN'),float('NaN'),float('NaN')]
elif child.text==None:
for element in child:
tmp1.append(element.text)
response.append(tmp1)
fromucmc=[float(response[x][3]) for x in range(len(response))]
rawdata=df['LBSTRESN'].tolist()
df['fromucmc']=fromucmc
check=[i!=j for i, j in zip(fromucmc, rawdata)]
return df[check]
issue=convert_unit(df, patterns)
issue[(issue['fromucmc'].notnull())]
issue[(issue['fromucmc'].isnull())]