A probabilistic description-oriented approach for categorising Web documents
Goevert, Norbert, Prof. Fuhr, Norbert, Prof'in Lalmas, Mounia
Dateibereich 5553
150,4 KB in einer Datei, zuletzt geändert am 16.04.1999
| Datei | Dateien geändert am | Größe |
|---|---|---|
| Goevert_etal-99.pdf | 16.04.1999 00:00:00 | 150,4 KB |
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
Lesezeichen:
Dokumententyp:
Wissenschaftliche Texte » Artikel, Aufsatz
Fakultät / Institut:
Fakultät für Ingenieurwissenschaften » Ingenieurwissenschaften - Campus Duisburg » Abteilung Informatik und Angewandte Kognitionswissenschaft
Dewey Dezimal-Klassifikation:
000 Informatik, Informationswissenschaft, allgemeine Werke » 000 Informatik, Wissen, Systeme
Stichwörter:
automatic categorisation, web documents
Sprache:
Deutsch
Kollektion / Status:
E-Publikationen / Dokument veröffentlicht
Dokument erstellt am:
16.04.1999
Dateien geändert am:
16.04.1999
Medientyp:
Text
