TY - JOUR
T1 - Deep learning-based NLP data pipeline for EHR-scanned document information extraction
AU - Hsu, Enshuo
AU - Malagaris, Ioannis
AU - Kuo, Yong Fang
AU - Sultana, Rizwana
AU - Roberts, Kirk
N1 - Publisher Copyright:
© 2022 The Author(s). Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2022/7/1
Y1 - 2022/7/1
N2 - Objective: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. Materials and Methods: We evaluated 2 key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2), from 955 scanned sleep study reports. Image preprocessing methods include gray-scaling, dilating, eroding, and contrast. OCR was implemented with Tesseract. Seven traditional machine learning models and 3 deep learning models were evaluated. We also evaluated combinations of image preprocessing methods, and 2 deep learning architectures (with and without structured input providing document layout information), with the goal of optimizing end-to-end performance. Results: Our proposed method using ClinicalBERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523 and document accuracy of 91.61% for SaO2. Discussion: There are multiple, inter-related steps to extract meaningful information from scanned reports. While it would be infeasible to experiment with all possible option combinations, we experimented with several of the most critical steps for information extraction, including image processing and NLP. Given that scanned documents will likely be part of healthcare for years to come, it is critical to develop NLP systems to extract key information from this data. Conclusion: We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing.
AB - Objective: Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. Materials and Methods: We evaluated 2 key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2), from 955 scanned sleep study reports. Image preprocessing methods include gray-scaling, dilating, eroding, and contrast. OCR was implemented with Tesseract. Seven traditional machine learning models and 3 deep learning models were evaluated. We also evaluated combinations of image preprocessing methods, and 2 deep learning architectures (with and without structured input providing document layout information), with the goal of optimizing end-to-end performance. Results: Our proposed method using ClinicalBERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523 and document accuracy of 91.61% for SaO2. Discussion: There are multiple, inter-related steps to extract meaningful information from scanned reports. While it would be infeasible to experiment with all possible option combinations, we experimented with several of the most critical steps for information extraction, including image processing and NLP. Given that scanned documents will likely be part of healthcare for years to come, it is critical to develop NLP systems to extract key information from this data. Conclusion: We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing.
KW - electronic health records
KW - natural language processing
KW - optical character recognition
KW - polysomnography
KW - scanned document
UR - http://www.scopus.com/inward/record.url?scp=85134905733&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134905733&partnerID=8YFLogxK
U2 - 10.1093/jamiaopen/ooac045
DO - 10.1093/jamiaopen/ooac045
M3 - Article
AN - SCOPUS:85134905733
SN - 2574-2531
VL - 5
JO - JAMIA Open
JF - JAMIA Open
IS - 2
M1 - ooac045
ER -