A probabilistic description-oriented approach for categorising Web documents

Goevert, Norbert, Prof. Dr.-Ing. Fuhr, Norbert, Prof'in Lalmas, Mounia

Dateibereich 5553

150,4 KB in einer Datei, zuletzt geändert am 02.08.2013

Dateiliste / Details

DateiDateien geändert amGröße
Goevert_etal_99.pdf16.04.1999 00:00:00150,4 KB
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
PURL / DOI:
Lesezeichen:
Permalink | Teilen/Speichern
Dokumententyp:
Wissenschaftliche Texte » Artikel, Aufsatz
Fakultät / Institut:
Fakultät für Ingenieurwissenschaften » Informatik und Angewandte Kognitionswissenschaft
Dewey Dezimal-Klassifikation:
000 Informatik, Informationswissenschaft, allgemeine Werke » 000 Informatik, Wissen, Systeme
Stichwörter:
automatic categorisation, web documents
Sprache:
Deutsch
Kollektion / Status:
E-Publikationen / Dokument veröffentlicht
Dokument erstellt am:
16.04.1999
Dateien geändert am:
02.08.2013
Medientyp:
Text