Information extraction for biodiversity literature: towards the curation of a dipterocarp database
2024
Gabud, R.S.
Information extraction (IE) is one of the tasks behind many relevant natural language processing (NLP) applications. IE exploits the information hidden in millions of scholarly articles by automatically extracting structured information from unstructured data sources which in turn are utilized to semi-automatically populate databases with content. In this work, authors designed new annotation guidelines for capturing concepts pertinent to dipterocarps such as taxon names, geographic locations, dates, habitat descriptions, authorities and names of herbaria. By applying the guidelines on the manual annotation of textual documents, the authors were able to build two gold standard corpora: 1) the COPIOUS corpus that contains annotations for named entity recognition (NER) that are relevant to biodiversity occurrence information, and 2) the DipteroMine corpus that has manual annotations for reproductive condition mentions relevant to dipterocarps in addition to the entity types covered by COPIOUS. Satisfactory agreements and fairly consistent annotations were found between two domain expert annotators for both COPIOUS and DipteroMine corpora. Additionally, the aurthors developed named entity recognizers by training Conditional Random Fields (CRF), and Bidirectional Long Short-Term Memory (Bi-LSTM) models on the COPIOUS dataset. The authors achieved an overall F1-score of 71.53% for the CRF-based NER model when evaluated on the test set. The Bi-LSTM-based NER model, trained on the COPIOUS corpus, demonstrated an F1-score of 74.58%. The authors then developed relation extraction (RE) methods that could provide fine-grained, descriptive information on habitats and reproductive conditions of plant species, e.g., dipterocarps, crucial in forest restoration and rehabilitation efforts. The authors devised unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, the authors handcrafted rules for a traditional rule-based pattern matching approach. The authors then developed a relation extraction (RE) approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the RE problem as a question answering and natural language inference task. A novel unsupervised hybrid approach that combines rule-based and transformer-based approaches was then proposed. Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated F1-scores ranging from 89.61% to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat geographic location relations. Our work on RE shows that even without training models on any domain-specific labeled dataset, authors are able to extract relationships between biodiversity concepts from literature with satisfactory performance. Furthermore, the authors implemented an information extraction (IE) pipeline consisting of an NER tool and our hybrid relation extraction (RE) tool. The authors applied one of our NER models to automatically annotate geographic location, temporal expression and habitat information contained within sentences. A dictionary-based approach was then used to identify mentions of reproductive conditions in text (e.g.., phrases such as 'fruited heavily' and 'mass flowering'). Hybrid RE tool was used to extract reproductive condition - temporal expression and habitat-geographic location entity pairs. The IE pipeline was tested on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and showed that this work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a dipterocarp database with the inclusion of habitat and reproductive condition information extracted from text.
Afficher plus [+] Moins [-]Mots clés AGROVOC
Informations bibliographiques
Cette notice bibliographique a été fournie par University of the Philippines at Los Baños
Découvrez la collection de ce fournisseur de données dans AGRIS