• 论文
主办单位:煤炭科学研究总院有限公司、中国煤炭学会学术期刊工作委员会
基于高置信度伪标签数据选择算法的临床事件抽取方法
  • Title

    A Clinical Event Extraction Method Based on a High-confidence Pseudo-label Data Selection Algorithm

  • 作者

    罗媛媛杨春明李波张晖赵旭剑

  • Author

    LUO Yuanyuan;YANG Chunming;LI Bo;ZHANG Hui;ZHAO Xujian

  • 单位

    成都东软学院计算机与软件学院西南科技大学 计算机科学与技术学院西南科技大学 数理学院四川省大数据与智能系统工程技术研究中心

  • Organization
    School of Computer and Software, Chengdu Neusoft Institute of Information
    School of Computer Science and Technology, Southwest University of Science and Technology
    School of Mathematics and Physics, Southwest University of Science and Technology
    Sichuan Big Data and Intelligent System Engineering Technology Research Center
  • 摘要

    【目的】事件抽取是构建高质量事件知识图谱的前提。临床事件抽取过程中事件元素存在依赖关系,现有方法无法准确识别事件元素并组合为事件,且现有临床事件标记数据较少,给事件抽取任务带来了极大的挑战。【方法】将临床事件抽取建模为实体识别模型,提出一种融合多特征的中文医学事件抽取方法:BERT-MCRF.该方法使用BERT构建模型的嵌入和特征提取部分,在CRF层加入多个字的滑动窗口特征,然后将BERT-MCRF作为半监督实验的基实验,提出一种高置信度伪标签数据选择算法作为筛选数据的条件,得到较高质量的300条数据与原始数据合并,最终构建了1700条语料,并重新训练模型。【结果】BERT-MCRF模型在3种属性实体上的整体F1值达到80.21%,比经典的BiLSTM-CRF模型提升15.11%;通过半监督思路重新训练的模型最终F1值达到81.56%,较原始BERT-MCRF提升了1.35%.

  • Abstract

    【Purposes】 Event extraction is a prerequisite for building high-quality event knowl-edge graphs. The dependency of event elements exists in the process of clinical event extraction. Existing methods fail to accurately identify event elements and combine them into events, and the amount of available clinical event tagging data is limited. These problems bring great challenges to the event extraction task. 【Methods】 In this research, clinical event is extracted and modelled as an entity recognition model, and a Chinese medical event extraction method incorporating multiple features is proposed: BERT-MCRF. In this method, Bidirectional Encoder Representation from Transformers(BERT) is used to construct the embedding and feature extraction parts of the model, multiple word sliding window features in the Conditional Random Fields(CRF) layer are added, then BERT-MCRF is used as a base experiment for semi-supervised experiments, and a high confidence pseudo-labeled data is proposed. The selection algorithm is used as a condition to filter the data, and 300 data of higher quality are obtained and merged with the original data. Fi-nally, 1 700 corpus are constructed and the model is retrained. 【Findings】 The overall F1 value of the BERT-MCRF model on the three attribute entities reaches 80.21%, which is 15.11% bet-ter than that of the classical Bi-directional Long Short Term Memory-Conditional Random Fields (BiLSTM-CRF) model; with the model retrained by the semi-supervised idea, the final F1 value reaches 81.56%, which is 1.35% higher than the original BERT-MCRF.

  • 关键词

    临床医学事件抽取实体识别多特征半监督学习高置信度伪标签选择算法

  • KeyWords

    clinical medical event extraction; entity recognition; multi-features; semi-super-vised learning; high-confidence pseudo-label selection algorithm

  • 基金项目(Foundation)
    四川省科技厅重点研发项目(2021YFG0031);四川省省级科研院所科技成果转化项目(22YSZH0021)
  • DOI
  • 引用格式
    罗媛媛,杨春明,李波,等.基于高置信度伪标签数据选择算法的临床事件抽取方法[J].太原理工大学学报,2024,55(1):204-213.
  • Citation
    LUO Yuanyuan,YANG Chunming,LI Bo,et al.A clinical event extraction method based on a high-confidence pseudo-label data selection algorithm[J].Journal of Taiyuan University of Technology,2024,55(1):204-213.
  • 相关专题
相关问题

主办单位:煤炭科学研究总院有限公司 中国煤炭学会学术期刊工作委员会

©版权所有2015 煤炭科学研究总院有限公司 地址:北京市朝阳区和平里青年沟东路煤炭大厦 邮编:100013
京ICP备05086979号-16  技术支持:云智互联