Named Entity Recognition for Classifying Technoscientific Persons: Combining Pre-trained Language Models and Silver Standard Datasets
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Research question: While generic named entity recognition (NER) models perform well on general tasks, custom NER models can provide more efficient and accurate solutions for specific domains. The chapter proposes the development of a custom NER model to classify technoscientific persons according to their professional expertise. Previous efforts to identify occupations have been limited due to the absence of precise annotation guidelines and reliable Gold Standard corpora. Methodology: This chapter aims to address this challenge by proposing a hybrid method. The method combines rule and dictionary-based approaches to capture domain-specific knowledge and to automatically annotate data. Bootstrapping is employed to improve the generalizability of the model and reduce overfitting. By training the model on different variations of the data and testing it on new validation sets, a more robust evaluation of the model's performance is possible. Finally, the efficiency and accuracy of the NER model are improved by using transfer learning with RoBERTa. Findings: The first model trained on the initial subcorpus provided accurate results for almost all categories. However, when the model was validated on the next subcorpus, it showed a dramatic decline in performance, implying overfitting. To address this issue, bootstrapping was employed by cumulatively adding different subcorpora and reviewing and correcting the annotations. The model was retrained at each step until a satisfactory level of performance was achieved. The final model performs well on all categories except for social sciences, environmental sciences, and life sciences. Significance: The proposed approach offers several benefits, including more efficient use of resources, improved accuracy, generalizability, and scalability. © The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Nature Switzerland AG 2024.











