Korpus-adaptive Eigennamenerkennung

Dateibereich 16089

1,64 MB in einer Datei, zuletzt geändert am 31.01.2007

Dateiliste / Details

DateiDateien geändert amGröße
diss_final2007_DS.pdf31.01.2007 14:49:121,64 MB
Named Entity Recognition (NER) is an important step towards the automatic analysis of natural language and is needed for a series of natural language applications. The task of NER requires the recognition and classification of proper names and other unique identifiers according to a predefined category system, e.g. the “traditional” categories PERSON, ORGANIZATION (companies, associations) and LOCATION. While most of the previous work deals with the recognition of these traditional categories within English newspaper texts, the approach presented in this thesis is beyond that scope. The approach is particularly motivated by NER which is more challenging than the classical task, such as German, or the identification of biomedical entities within scientific texts. Additionally, the approach addresses the ease-of-development and maintainability of NER-services by emphasizing the need for “corpus-adaptive” systems, with “corpus-adaptivity” describing whether a system can be easily adapted to new tasks and to new text corpora. In order to implement such a corpus-adaptive system, three design guidelines are proposed: (i) the consequent use of machine-learning techniques instead of manually created linguistic rules; (ii) a strict data-oriented modelling of the phenomena instead of a generalization based on intellectual categories; (iii) the usage of automatically extracted knowledge about Named Entities, gained by analysing large amounts of raw texts. A prototype was implemented according to these guidelines and its evaluation shows the feasibility of the approach. The system originally developed for a German newspaper corpus could easily be adapted and applied to the extraction of biomedical entities within scientific abstracts written in English and therefore gave proof of the corpus-adaptivity of the approach. Despite the limited resources in comparison with other state-of-the-art systems, the prototype scored competitive results for some of the categories.
Lesezeichen:
Permalink | Teilen/Speichern
Dokumententyp:
Wissenschaftliche Abschlussarbeiten » Dissertation
Fakultät / Institut:
Fakultät für Ingenieurwissenschaften » Informatik und Angewandte Kognitionswissenschaft » Informatik » Wissensbasierte und Natürlichsprachliche Systeme
Dewey Dezimal-Klassifikation:
400 Sprache » 410 Linguistik » 410 Linguistik
000 Informatik, Informationswissenschaft, allgemeine Werke » 000 Informatik, Wissen, Systeme
Stichwörter:
Computerlinguistik, Natural Language Processing (NLP), Human Language Technology, Text Mining, Machine Learning, Support Vector Machine (SVM)
Beitragende:
Prof. Hoeppner, Wolfgang [Betreuer(in), Doktorvater]
Prof. Dr. Morik, Katharina [Gutachter(in), Rezensent(in)]
Prof. apl. Dr. Biehl, Jürgen [Gutachter(in), Rezensent(in)]
Sprache:
Deutsch
Kollektion / Status:
Dissertationen / Dokument veröffentlicht
Datum der Promotion:
21.12.2006
Dokument erstellt am:
30.01.2007
Promotionsantrag am:
12.05.2006
Dateien geändert am:
31.01.2007
Medientyp:
Text