EACL'03 Tutorial on Text Representation for Automatic Text Categorization

José María Gómez Hidalgo

Departamento de Inteligencia Artificial (Artificial Intelligence Departament)

Universidad Europea de Madrid

Summary  - proposal (PDF, ZIP)
Slides of the tutorial (PDF 4Mb, ZIP 1,94Mb)
References (BIB, online searchable bibliography)


The goal of this tutorial is making the audience familiar to the different ways that text is represented in Automated Text Categorization (ATC). I define and describe applications of Text Categorization, and present the general model for learning based ATC. Then I describe a number of works in which proposals for  text representation have been presented, including the usage of statistical and linguistic phrases, Information Extraction patterns, and WordNet information. Also specific application text features are presented, based on stylometry and structural text properties, for several tasks including author, language and genre identification, spam detection, etc.

The tutorial is mainly divided into two parts: an overview of ATC, and the discussion of specific text representations. In particular, the tutorial covers the following topics:

1. Automated Text Categorization (ATC)

A definition of (Automated) Text Categorization is presented, along a number of considerations. Some relevant links to this part are:

2. Applications

A number of applications are brieftly described in this tutorial. Some interesting sites include:

3. A blueprint for learning-based ATC

Here I present the learning-based model for ATC. Some interesting links include:

4. Advanced document indexing

Some links on the topics covered in this part:

5. Task oriented features

Some links on the topics covered in this part:

6. Summary

Some links on the topics covered in this part:

Last updated February, 28th 2003