ECML/PKDD'02 Tutorial on Text Mining and Internet Content filtering

José María Gómez Hidalgo

Departamento de Inteligencia Artificial

Universidad Europea de Madrid


Slides of the tutorial


In the recent years, we have witnessed an impressive growth of the availability of information in electronic format, mostly in the form of text, due to the Internet and the increasing number and size of digital and corporate libraries. The overwhelming amount of text is hardly to consume for an average human being, who faces an information overload problem. As traditional Data Mining (or more properly, Knowledge Discovery in Databases, KDD) is about finding patterns in data, Text Data Mining (Text Mining, TM for short) is about uncovering patterns in data when the data is text. In other words, the goal of TM is turning the information buried in text into valuable knowledge that alleviates information overload.

TM is an emerging research and development field that address the information overload problem borrowing techniques from data mining, machine learning, information retrieval, natural-language understanding, case-based reasoning, statistics, and knowledge management to help people gain rapid insight into large quantities of semi-structured or unstructured text. TM includes several text processing and classification techniques, as text categorization, clustering and retrieval, information extraction, and others, but it also involves the development of new methods for information analysis, digesting and presentation.

A prototypical application of TM techniques is Internet information filtering. The easiness of Internet-based information publishing and communication makes it prone to misuse. For instance, websites devoted to pornography, racism, terrorism, etc. are daily accessed by easily influenced under age persons. Also, Internet email users have to bear intrusive unsolicited bulk email that makes it less valuable and more expensive as a communication means. Internet filtering through TM techniques is a promising work field that will provide the Internet community with more accurate and cheap systems for limiting youngsters access to illegal and offensive Internet content, and for alleviating the unsolicited bulk email problem.


Outline

The goal of this tutorial is making the audience familiar to the emerging area of Text Mining, in a practical way. This goal will be achieved by realizing the concepts about the field through two Text Categorization applications, focused on Internet information filtering: the detection of offensive websites, and the detection of unsolicited bulk email. Being relatively simple, these applications will allow the audience to understand the main topics in Text Mining.

The tutorial is divided into two main parts. The first part of the tutorial is an overview of TM topics, focusing in the specific problems of TM in relation to KDD. The concepts will be covered in a classification task oriented fashion, where a number of supervised and unsupervised learning tasks will be reviewed. The second part will realize the concepts in TM through the detailed analysis of the two previously mentioned Internet filtering tasks. Indeed, regarding the detection of offensive websites, an operational system will be quickly produced by reusing a number of open-source tools, including the Muffin proxy system and the Waikato Environment for Knowledge Analysis (WEKA) learning library.

In particular, the tutorial will cover the following topics:

1. TM: what is it and what is it not?

This section will cover introductory topics, will state the main specific problems in TM (in relation to KDD), and will include a review of hot Text Mining applications. See also these links:

2. Learning from text when we know what about to learn

Covering document categorization and filtering; topic detection and tracking; term identification, extraction and categorization, including text representation models, Part-Of-Speech Tagging and Word Sense Disambiguation; shallow parsing; information extraction. See also these links:

3. Learning from text when we do not know what about to learn

Covering document clustering and term clustering (including Latent Semantic Indexing, automatic thesaurus construction, etc.); discovering relations among documents and terms, and key phrase extraction; document summarization. See also these links:

4. Tools for TM

Review of available commercial and research tools; the Waikato Environment for Knowledge Analysis. See also these links:

5. Application to the detection of offensive websites

Including motivation, web pages analysis and processing, learning useful regularities among offensive web pages, evaluating detection systems, an operational solution based on open-source software; the POESIA (Public Open source Environment for a Safer Internet Access) project. See also these links:

6. Application to the detection of unsolicited bulk email

Including motivation, email messages analysis and processing, learning useful regularities among unsolicited email messages, evaluating detection systems. See also these links:

7. Challenges in TM

Addressing exploratory text analysis with the aid of visualization tools for finding relations among facts. See also:


Audience and Prerrequisites

The tutorial is of interest for both researchers and practitioners of KDD and machine learning (and thus, for those attending to ECML or PKDD). Researchers will get a practical overview of the TM field from the point of view of applied, interactive KDD proccess. Practitioners will get a better understanding of the specific problems of KDD when the data is text, and their relation with the recurrent problems in KDD.

A basic knowledge of machine learning and KDD is recommended. Familiarity with the Java programming language is interesting.


For more information, plesase visit the Tutorials home page at the ECML/PKDD 02 website.

José María Gómez Hidalgo

Last updated July, 26th 2002