TY - JOUR
T1 - Recognizing software names in biomedical literature using machine learning
AU - Wei, Qiang
AU - Zhang, Yaoyun
AU - Amith, Muhammad
AU - Lin, Rebecca
AU - Lapeyrolerie, Jenay
AU - Tao, Cui
AU - Xu, Hua
N1 - Publisher Copyright:
© The Author(s) 2019.
PY - 2020/3/1
Y1 - 2020/3/1
N2 - Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning–based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.
AB - Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning–based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.
KW - biomedical literature
KW - biomedical software
KW - biomedical software index
KW - named entity recognition
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85074012877&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074012877&partnerID=8YFLogxK
U2 - 10.1177/1460458219869490
DO - 10.1177/1460458219869490
M3 - Article
C2 - 31566474
AN - SCOPUS:85074012877
SN - 1460-4582
VL - 26
SP - 21
EP - 33
JO - Health Informatics Journal
JF - Health Informatics Journal
IS - 1
ER -