Rössler, Marc:

Korpus-adaptive Eigennamenerkennung

Duisburg, Essen (2006), 195 S.
Dissertation / Fach: Allgemeines, Sonstiges
Fakultät für Ingenieurwissenschaften » Informatik und Angewandte Kognitionswissenschaft » Informatik » Wissensbasierte und Natürlichsprachliche Systeme
Hoeppner, Wolfgang (Doktorvater, Betreuerin)
Morik, Katharina; Biehl, Jürgen (GutachterIn)
Dissertation
Abstract:
Named Entity Recognition (NER) is an important step towards the automatic analysis of natural language and is needed for a series of natural language applications. The task of NER requires the recognition and classification of proper names and other unique identifiers according to a predefined category system, e.g. the “traditional” categories PERSON, ORGANIZATION (companies, associations) and LOCATION. While most of the previous work deals with the recognition of these traditional categories within English newspaper texts, the approach presented in this thesis is beyond that scope. The approach is particularly motivated by NER which is more challenging than the classical task, such as German, or the identification of biomedical entities within scientific texts. Additionally, the approach addresses the ease-of-development and maintainability of NER-services by emphasizing the need for “corpus-adaptive” systems, with “corpus-adaptivity” describing whether a system can be easily adapted to new tasks and to new text corpora. In order to implement such a corpus-adaptive system, three design guidelines are proposed: (i) the consequent use of machine-learning techniques instead of manually created linguistic rules; (ii) a strict data-oriented modelling of the phenomena instead of a generalization based on intellectual categories; (iii) the usage of automatically extracted knowledge about Named Entities, gained by analysing large amounts of raw texts. A prototype was implemented according to these guidelines and its evaluation shows the feasibility of the approach. The system originally developed for a German newspaper corpus could easily be adapted and applied to the extraction of biomedical entities within scientific abstracts written in English and therefore gave proof of the corpus-adaptivity of the approach. Despite the limited resources in comparison with other state-of-the-art systems, the prototype scored competitive results for some of the categories.