A Clinical Event Extraction Method Based on a High-confidence Pseudo-label Data Selection Algorithm
LUO Yuanyuan;YANG Chunming;LI Bo;ZHANG Hui;ZHAO Xujian
成都东软学院计算机与软件学院西南科技大学 计算机科学与技术学院西南科技大学 数理学院四川省大数据与智能系统工程技术研究中心
【目的】事件抽取是构建高质量事件知识图谱的前提。临床事件抽取过程中事件元素存在依赖关系,现有方法无法准确识别事件元素并组合为事件,且现有临床事件标记数据较少,给事件抽取任务带来了极大的挑战。【方法】将临床事件抽取建模为实体识别模型,提出一种融合多特征的中文医学事件抽取方法:BERT-MCRF.该方法使用BERT构建模型的嵌入和特征提取部分,在CRF层加入多个字的滑动窗口特征,然后将BERT-MCRF作为半监督实验的基实验,提出一种高置信度伪标签数据选择算法作为筛选数据的条件,得到较高质量的300条数据与原始数据合并,最终构建了1700条语料,并重新训练模型。【结果】BERT-MCRF模型在3种属性实体上的整体F1值达到80.21%,比经典的BiLSTM-CRF模型提升15.11%;通过半监督思路重新训练的模型最终F1值达到81.56%,较原始BERT-MCRF提升了1.35%.
【Purposes】 Event extraction is a prerequisite for building high-quality event knowl-edge graphs. The dependency of event elements exists in the process of clinical event extraction. Existing methods fail to accurately identify event elements and combine them into events, and the amount of available clinical event tagging data is limited. These problems bring great challenges to the event extraction task. 【Methods】 In this research, clinical event is extracted and modelled as an entity recognition model, and a Chinese medical event extraction method incorporating multiple features is proposed: BERT-MCRF. In this method, Bidirectional Encoder Representation from Transformers(BERT) is used to construct the embedding and feature extraction parts of the model, multiple word sliding window features in the Conditional Random Fields(CRF) layer are added, then BERT-MCRF is used as a base experiment for semi-supervised experiments, and a high confidence pseudo-labeled data is proposed. The selection algorithm is used as a condition to filter the data, and 300 data of higher quality are obtained and merged with the original data. Fi-nally, 1 700 corpus are constructed and the model is retrained. 【Findings】 The overall F1 value of the BERT-MCRF model on the three attribute entities reaches 80.21%, which is 15.11% bet-ter than that of the classical Bi-directional Long Short Term Memory-Conditional Random Fields (BiLSTM-CRF) model; with the model retrained by the semi-supervised idea, the final F1 value reaches 81.56%, which is 1.35% higher than the original BERT-MCRF.
clinical medical event extraction; entity recognition; multi-features; semi-super-vised learning; high-confidence pseudo-label selection algorithm
主办单位:煤炭科学研究总院有限公司 中国煤炭学会学术期刊工作委员会