What is Text Classification?
Text classification allows to assign unknown documents to a set of classes, based on the textual content of the document. Main goals of text classification cover organizing a vast amount of (unstructured) data and easing the search for information.
Depending on the application that uses a text classification system the requirements vary. The choice of an appropriate algorithm has to take these requirements into account. Requirements concerning the setup of the classes may for example be:
- How many classes are needed?
- Do classes require hierarchical ordering or is a flat structure sufficient?
- Should every document be assigned to one class only or is multi-labelling required?
This website gives an overview on the topic of text classification, providing information on algorithms, the evaluation of common approaches and applications of text classification, supplemented by a demo.
How does Text Classification work ?
In general, text classification is divided into two phases. The first phase is dedicated to learning the characteristics of each class using a set of training documents, while the second phase fulfills the classification task itself - based on the previously learned characteristics.
A Typical Text Classification Workflow
During training, classes are set up and profiles or rules for each class (or category) are generated. A profile can be understood as a set of characteristics for each category. Typical characteristics are the appearance or absence of words or phrases. In most cases, additional techniques for defining the importance of a term within a document are used. For example: term frequency, term weight and inverse document frequency. These methods capture the problem that some terms appear in almost every document whereas some terms are rare.
For a more detailed explanation see page 18 in Hierarchical Text Classification using Methods from Machine Learning (Granitzer).
In order to extract only the most relevant terms, a lot of (linguistic) preprocessing may be utilized. In concrete implementations, costs and benefits must be weighed. For example, morphological methods slightly increase classification accuracy at the cost of higher computational resource consumption. The following list gives an overview on the most commonly used steps:
- Part-of-Speech Tagging
- Named Entity Recognition
- Word Sense Disambiguation
- Stop Word Filtering
The knowledge base used for classification is built upon the profiles generated during training. Documents can than be passed to the system for classification, so that it decides which class will be assigned to each of them. How the decision is made depends on the underlying algorithm.