Data mining


Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.

Размер: 25,3 K
Тип: доклад

Другие файлы:

Knowledge Discovery Practices and Emerging Applications of Data Mining: Trends and New Domains
Recent developments have drastically increased the volume and complexity of data available to be mined, leading researchers to explore new ways to gle...

The Handbook of Data Mining
Created with the input of a distinguished International Board of the foremost authorities in data mining from academia and industry, The Handbook of D...

Data Mining: Foundations and Practice
This book contains valuable studies in data mining from both foundational and practical perspectives. The foundational studies of data mining may help...

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
Web mining aims to discover useful information and knowledge from Web hyperlinks, page contents, and usage data. Although Web mining uses many convent...

Introduction to Data Mining
Introduction to Data Mining presents fundamental concepts and algorithms for those learning data mining for the first time. Each major topic is organi...

Краткое сожержание материала:

Размещено на


Data mining

Students KA-81

Bogomazova Yana

Karkunow Yaroslav

Kyiv 2011

Table of Contents



1. What is Data Mining

2. Developmental History of Data Mining and Knowledge Discovery

3. Theoretical Principles

4. Technological Elements of Data Mining

5. Steps in Knowledge Discovery

5.1 Step 1: Task Discovery

5.2 Step 2: Data Discovery

5.3 Step 3: Data Transformation

5.4 Step 4: Data Reduction

5.5 Step 5: Discovering Patterns (aka Data Mining)

5.6 Step 6: Result Interpretation and Visualization

5.7 Step 7: Putting the Knowledge to Use

6. Data Mining Methods

6.1 Classification

6.2 Regression

6.3 Clustering

6.4 Summarization

6.5 Change and Deviation Detection

7. Related Disciplines: Information Retrieval and Text Mining

7.1 Information Retrieval (IR)

7.2 IR Contributions to Data Mining

7.3 Data Mining Contributions to IR

8. Text Mining




Data mining or knowledge discovery refers to the process of finding interesting information in large repositories of data. The term data mining also refers to the step in the knowledge discovery process in which special algorithms are employed in hopes of identifying interesting patterns in the data. These interesting patterns are then analyzed yielding knowledge. The desired outcome of data mining activities is to discover knowledge that is not explicit in the data, and to put that knowledge to use.

Librarians involved in digital libraries are already benefiting from data mining techniques as they explore ways to automatically classify information and explore new approaches for subject clustering (MetaCombine Project). As the field grows, new applications for libraries are likely to evolve and it will be important for library administrators to have a basic understanding of the technology.

A wide variety of data mining techniques are also employed by industry and government. Many of these activities pose threats to personal privacy. As professionals ethically bound to ensure that individual privacy is safe-guarded, data mining activities should be monitored and kept on every librarian's radar.

This paper is written for information professionals who would like a better understanding of knowledge discovery and data mining techniques. It explains the historical development of this new discipline, explains specific data mining methods, and concludes that future development should focus on developing tools and techniques that yield useful knowledge without invading individual privacy. 2


Data mining is an ambiguous term that has been used to refer to the process of finding interesting information in large repositories of data. More precisely, the term refers to the application of special algorithms in a process built upon sound principles from numerous disciplines including statistics, artificial intelligence, machine learning, database science, and information retrieval (Han & Kamber, 2001).

Data mining algorithms are utilized in the process of pursuits variously called data mining, knowledge mining, data driven discovery, and deductive learning (Dunham, 2003). Data mining techniques can be performed on a wide variety of data types including databases, text, spatial data, temporal data, images, and other complex data (Frawley, Piatetsky-Shapiro, & Matheus, 1991; Hearst, 1999; Roddick & Spiliopoulou, 1999; Zaпane, O.R., Han, J., Li, Z., & Hou, J, 1998).

Some areas of specialty have a name such as KDD (knowledge discovery in databases), text mining and Web mining. Most of these specialties utilize the same basic toolset and follow the same basic process and (hopefully) yield the same product - useful knowledge that was not explicitly part of the original data set (Benoоt, 2002; Han & Kamber, 2001,Fayyed, Piatetsky-Shapiro, & Smyth, 1996). 3

1. What is Data Mining

data knowledge information mining

Data mining refers to the process of finding interesting patterns in data that are not explicitly part of the data (Witten & Frank, 2005, p. xxiii). The interesting patterns can be used to tell us something new and to make predictions. The process of data mining is composed of several steps including selecting data to analyze, preparing the data, applying the data mining algorithms, and then interpreting and evaluating the results. Sometimes the term data mining refers to the step in which the data mining algorithms are applied. This has created a fair amount of confusion in the literature. But more often the term is used to refer the entire process of finding and using interesting patterns in data (Benoоt, 2002).

The application of data mining techniques was first applied to databases. A better term for this process is KDD (Knowledge Discovery in Databases). Benoоt (2002) offers this definition of KDD (which he refers to as data mining):

Data mining (DM) is a multistaged process of extracting previously unanticipated knowledge from large databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer associations and rules from them. The extracted information may then be applied to prediction or classification models by identifying relations within the data records or between databases. Those patterns and rules can then guide decision making and forecast the effects of those decisions.

Today, data mining usually refers to the process broadly described by Benoоt (2002) but without the restriction to databases. It is a “multidisciplinary field drawing work from areas including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern recognition, knowledge-based systems, knowledge acquisition, information retrieval, high-performance computing and data visualization. (Han & Kamber, 2001, p. xix).

Data mining techniques can be applied to a wide variety of data repositories including databases, data warehouses, spatial data, multimedia data, Internet or Web-based data and complex objects. A more appropriate term for describing the entire process would be knowledge discovery, but unfortunately the term data mining is what has caught on (Andrбssoyб & Paraliи, 1999).

2. Developmental History of Data Mining and Knowledge Discovery

The building blocks of today's data mining techniques date back to the 1950s when the work of mathematicians, logicians, and computer scientists combined to create artificial intelligence (AI) and machine learning (Buchanan, 2006.).

In the 1960s, AI and statistics practitioners developed new algorithms such as regression analysis, maximum likelihood estimates, neural networks, bias reduction, and linear models of classification (Dunham, 2003, p. 13). The term “data mining” was coined during this decade, but the term was pejoratively used to describe the practice of wading through data and finding patterns that had no statistical significance (Fayyad, et al., 1996, p. 40). 5

Also in the 1960s, the field of information retrieval (IR) made its contribution in the form of clustering techniques and similarity measures. At the time these techniques were applied to text documents, but they would later be utilized when mining data in databases and other large, distributed data sets (Dunham, 2003, p. 13). Database systems focus on query and transaction processing of structured data, whereas information retrieval is concerned with the organization and retrieval of information from a large number of text-based documents (Han & Kamber, 2001, p. 428). By the end of the 1960s, information retrieval and database systems were developing in parallel.

In 1971, Gerard Salton published his groundbreaking work on the SMART Information Retrieval System. This represented a new approach to information retrieval which utilized the algebra-based vector space model (VSM). VSM models would prove to be a key ingredient in the data mining toolkit (Dunham, 2003, p. 13).

Throughout the 1970s, 1980s, and 1990s, the confluence of disciplines (AI, IR, statistics, and database systems) plus the availability of fast microcomputers opened up a world of possibilities for retrieving and analyzing data. During this time new programming languages were developed and new computing techniques were developed including genetic algorithms, EM algorithms, K-Means clustering, and decision tree algorithms (Dunham, 2003, p. 13).

By the start of the 1990s, the term Knowledge Discovery in Databases (KDD) had been coined and the first KDD workshop held (Fayyad, Piatetsky-Shapiro, & Smyth, 1996, p. 40). The huge volume of data available created the need for new techniques for handling massive quantities of information, much of which was located in huge databases.

The 1990s saw the development...