Driver, Jan; Heider, Dominik; Hauke, Sascha; Borschbach, Markus; Pyka, Martin:

Hierarchical Text Clustering Based on Independent Component Analysis

In: Proceedings of the Computational Linguistics-Applications Conference
Warsaw, Poland (2011), S. 17-21
ISBN: 978-83-60810-47-7
Buchaufsatz / Kapitel / Fach: Informatik
Fakultät für Biologie
Zentrale wissenschaftliche Einrichtungen » Zentrum für Medizinische Biotechnologie (ZMB)
Abstract:
A data-driven classification of text corpora without a priori knowledge about the underlying topics requires the automated detection of similarities between different documents to form semantic clusters. However, semantic categories that are valuable for human needs exist on different levels of abstraction. Therefore, augmenting semantic features with hierarchical dependencies to other semantic categories are an important way to improve the classification process. In this article, we present a procedure that generates a hierarchical representation of the semantic categories found in a collection of documents. First, an independent component analysis is applied to identify the feature space underlying the documents of the corpus. Thereby, the number of relevant components is estimated automatically based on the Bayesian information criterion. Using the mutual information of the feature space, meta-categories are generated that lead to the emergence of hierarchical structures.