Wrote Ing. Petr Klímek, Ph.D. - Fakulta managementu a ekonomiky UTB ve Zlíně, Ústav informatiky a statistiky, e-mail: klimek@fame.utb.cz
1. INTRODUCTION
The main purpose of text mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms discussed in [5]. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, we can analyze words, clusters of words used in documents, etc., or we could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will “turn text into numbers” (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in [1,3]
2. DOCUMENT REPRESENTATION
There is a main mean of document representation – it is a vector which has the same amount of parts as number of words in dictionary or in a file of documents. Each term has its unique fixed position.Individual documents are represented by rare vectors of thousands values. Each term from dictonary can be encoded for the individual document in following ways:
- binary (appearance or non-appearance in document),
- number of appearances in document,
- using value TFIDF (term frequency inverse document frequency):
where n is a number of term appearances in document, m is a number of term appearances in whole collection and M is a number of documents in collection. But the main problem is high dimension of these vectors. How to lower this dimension/number of attributes? One possibility is to use only certaim terms, the other is to transform these terms. For these cases, we can use reduction of dimension methods well known from pattern recognition.We can use cluster analysis methods to identify groups of documents (e.g., vehicle owners who described their new cars), to identify groups of similar input texts. This type of analysis also could be extremely useful in the context of market research studies, for example of new car owners. We can also use factor analysis and principal components and classification analysis.
3. APPLICATIONS OF TEXT MINING
Unstructured text is very common, and in fact may represent the majority of information available to a particular research or data mining project.
Automatic processing of messages, emails, etc. Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to “filter” out automatically most undesirable “junk email” based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency.
Analyzing open-ended survey responses. In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their “views” or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers’ views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by “experts.”
Analyzing warranty or insurance claims, diagnostic interviews, etc. In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when we take our cars to a service station for repairs, typically, the attendant will write some notes about the problems that we report and what we believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms.
Text mining as document search. There is another type of application that is often described and referred to as “text mining” – the automatic search of large numbers of documents based on key words or key phrases. This is the domain of, for example, the popular internet search engines that have been developed over the last decade to provide efficient access to Web pages with certain content. This is obviously an important type of application with many uses in any organization that needs to search very large document repositories based on varying criteria.
4. TEXT MINING SOFTWARE TOOLS
In spite of being a subject of permanent research we can find a lot of commercial software tools for text mining. For example Text Miner from SAS, (http://www.sas.com), Intelligent Miner for Text (http://www.software.ibm.com) or Text Analyst of Megaputer Intelligence (http://www.megaputer.com). We could find a list of them on pages Kddnuggets (http://www.kddnuggets.com).
5. FUTURE OVERVIEW
It seems that the next step in line of text mining – web mining [2,4] would be multimedia mining., i.e. the knowledge discovery from multimedia data (texts, pictures, sounds, videos, etc.) We can enjoy brand new methods and algorithms and also development of new software tools in this field of data mining in future.
REFERENCES
[1] JAŠEK, R. Technology in emergency planning and management and business kontinuity. In Bezpieczenstwo w administracji i biznesie. GDYNIA: WSAIB im. Eugeniusze Kwiatkowskiego w Gdyni. 2007. s.257-264. ISBN 978-83-918369-2-7
[2] JAŠEK, R. Konkurenční zpravodajství a jeho význam pro strategické řízení. In: Internet a bezpečnost organizací. VIII. Ročník mezinárodní konference. Zlín, 14.března 2006. Zlín: Univerzita Tomáše Bati ve Zlíně.2006. s.76. ISBN 80-7318-393-5.
[3] KLÍMEK, P. Data mining a jeho využití. E+M (Ekonomie a management), r. 8, č. 3, s. 128–135. Liberec: HF TU v Liberci, 2005. ISSN 1212-3609
[4] KLÍMEK, P. Dobývání znalostí z webu – webmining. In. Konference Internet a bezpečnost organizací, sborník anotací (VIII. ročník). Zlín: FaME, 2005, s. 56. ISBN 80-7318-393-5
[5] KLÍMEK, P. Data mining – klíč ke zvýšení konkurenceschopnosti. In. Konference Internet a bezpečnost organizací, sborník anotací (VIII. ročník). Zlín: FaME, 2006, s. 85. ISBN 80-7318-393-5
Relevant URLs are mentioned in italic directly in text




