|
|

Data Mining#REDIRECT Data mining Data miningData mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition. ==Definition== Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [1] and "The science of extracting useful information from large data sets or databases" [2]. Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied meaning in a wide range of contexts. A simple example of data mining is its use in a retail sales department. If a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts. The sales department will look at that information and begin direct mail marketing of silk shirts to that customer. In this case, the data mining system used by the retail store discovered new information about the customer that was previously unknown to the company. ==History== Data Mining grew as a direct consequence of the availability of large reservoirs of data. Data collection in digital form was already underway by the 1960s, allowing for retrospective data analysis via computers. Relational Databases arose in the 1980s along with Structured Query Languages (SQL), allowing for dynamic, on-demand analysis of data. The 1990s saw an explosion in growth of data. Data warehouses were beginning to be used for storage of data. Data Mining thus arose as a response to challenges faced by the database community in dealing with massive amounts of data, application of statistical analysis to data and application of search techniques from Artificial Intelligence to these problems. ==Data dredging== Used in the technical context of data warehousing and analysis, data mining is neutral. However, it sometimes has a more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misleading or trivial attribute correlation is more properly criticized as "data dredging" in the statistical literature. Used in this latter sense, data dredging implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. (This is also referred to as "overfitting the model".) The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect. In spite of this, some exploratory data analysis is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than clear. A more significant danger is finding correlations that do not really exist. Investment analysts appear to be particularly vulnerable to this. "There have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it." [3] Most data mining efforts are focused on developing a finely-grained, highly detailed model of some large data set. Other researchers have described an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler models that represent relevant data. [4] ==Privacy concerns== There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they may screen out people with diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems. Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns. [5] There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of people could be used to find combinations of drugs with an adverse reactions. Since the combination may occur in only 1 out of 1000 people, a single case may not be apparent. A project involving pharmacies could reduce the number of drug reactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database. Basically, data mining gives information that wouldn't be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics. ==Combinatorial game data mining== *Data mining from combinatorial game Oracle machines: Since the early 1990's, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. This is pattern-recognition at too high an abstraction for known Statistical Pattern Recognition algorithms or any other algorithmic approaches to be applied: at least, no one knows how to do it yet (as of January 2005). The method used is the full force of Scientific Method: extensive experimentation with the tablebases combined with intensive study of tablebase-answers to well designed problems, combined with knowledge of prior art i.e. pre-tablebase knowledge, leading to flashes of insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of people doing this work, though they were not and are not involved in tablebase generation. ==See also== *Artificial intelligence *Artificial neural network *Business intelligence *Business performance management *Database *Data stream mining *Data warehouse *Decision tree *Descriptive statistics *Document warehouse *Fuzzy logic *Hypothesis testing *Linear discriminant analysis *Logit (in reference to logistic regression) *Loyalty card *Machine learning *Nearest neighbor (pattern recognition) *Pattern recognition *Principal components analysis *Regression analysis *Relational data mining *Statistics *Text mining ==References== Endnotes: [1] W. Frawley and G. Piatetsky-Shapiro and C. Matheus, Knowledge Discovery in Databases: An Overview. AI Magazine, Fall 1992, pgs 213-228. [2] D. Hand, H. Mannila, P. Smyth: Principles of Data Mining. MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X [3] Fred Schwed, Jr, Where Are the Customers' Yachts? ISBN 0471119792 (1940). [4] T. Menzies, Y. Hu, Data Mining For Very Busy People. IEEE Computer, October 2003, pgs 18-25. [5] K. A. Taipale, [http://ssrn.com/abstract=546782 Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data], [http://www.advancedstudies.org/ Center for Advanced Studies in Science and Technology Policy]. [http://www.stlr.org/cite.cgi?volume=5&article=2 5 Colum. Sci. & Tech. L. Rev. 2] (December 2003). Other: * Jaiwei Han and Micheline Kamber, ''Data Mining: Concepts and Techniques'' (2001), ISBN 1-55860-489-8 * Ruby Kennedy et al., ''Solving Data Mining Problems Through Pattern Recognition'' (1998), ISBN 0-13-095083-1 * O. Maimon and M. Last, Knowledge Discovery and Data Mining – The Info-Fuzzy Network (IFN) Methodology, Kluwer Academic Publishers, Massive Computing Series, 2000. * Hari Mailvaganam, [http://www.dwreview.com/Data_mining/Future_data_mining.html Future of Data Mining], [http://www.dwreview.com/ (December 2004)] * Sholom Weiss and Nitin Indurkhya, ''Predictive Data Mining'' (1998), ISBN 1-55860-403-0 * Ian Witten and Eibe Frank, ''Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations'' (2000), ISBN 1-55860-552-5 ==External links== * [http://www.data-mining-guide.net Data Mining Software Guide] * [http://searchcrm.techtarget.com?offer=Wikipedia SearchCRM.com] Original daily breaking news, white papers, expert advice, webcasts, product reviews and more on data mining. * [http://www.twocrows.com/about-dm.htm Limited intoduction to Data Mining (TwoCrows.com)] * [http://www.thearling.com Comprehensive data mining white papers and tutorials (thearling.com)] * [http://www.sqlserverdatamining.com SQLServerDataMining.com] Information and interactive demos on SQL Server 2005 Data Mining * [http://www.crm2day.com/data_mining/ CRM Today - Data Mining] White papers, articles, presentations and academic papers on data mining * [http://www.bitpipe.com/rlist/term/Data-Mining.html Data Mining whitepapers, webcasts and case studies] * [http://www.kdnuggets.com/ KDnuggets] Directory for data mining, knowledge discovery, genomic mining, web mining * [http://www.siebel.com/business-analytics/software-solutions.shtm Data Mining] Software from Siebel * [http://www.kmining.com/info_conferences.html Kmining] List of data mining and KDD scientific conferences * [http://www.data-mining-guide.net Data Mining] Guide * [http://www.dwreview.com/Data_mining/index.html Data Mining and Data Warehousing] Guide to Data Mining * [http://www.cs.waikato.ac.nz/ml/weka/ Weka] Open source data mining software written in Java * [http://www.csse.monash.edu.au/~mgaber/WResources.htm Mining Data Streams Bibliography] Links to papers related to mining data streams, its techniques and applications * [http://www.ailab.si/orange Orange] Open source data mining software in C++ and Python ==Commercial solutions== (Alphabetical) * [http://www-306.ibm.com/software/data/db2/udb/dwe/ IBM DB2 Universal Database Data Warehouse Editions] * [http://www.mathworks.com/ MATLAB] * [http://www.microsoft.com/sql/ Microsoft SQL Server] * [http://www.saksoft.com Saksoft] * [http://www.sas.com/technologies/analytics/datamining/ SAS] * [http://www.cytel.com/XLMiner/ XLMiner by Cytel Software] [http://www.kdnuggets.com/software/index.html KDnuggets' data mining software list] ==Links Leading to Leading DM Companies and Products== * [http://www.kofax.com/index.asp Kofax] * [http://www.in-q-tel.org/ In-Q-Tel] * [http://www.novodynamics.com Novo Dynamics Inc.] Data management Information technology Information technology management Business intelligence Knowledge discovery in databases su:Data mining th:การทำเหมืองข้อมูล zh-cn:数据挖掘 Data miningVery wierd --- long introduction, then a table of contents, followed by brief history, some links and some references. I didn't find this entry academic enough, there aren't enough examples of theory, applications and methods, it mostly talks about the dangers of data mining as if it is a dangerous thing. Needs content. --User:Exa 15:53, 20 Apr 2004 (UTC) I've removed the Motley Fool Foolish Four reference as an example of retrospective data mining. Funnily enough, I worked at the Motley Fool at the time -- the UK site -- and the statement as it stood was factually incorrect. Without wanting to digress too wildly: the Foolish Four was abandoned because at a time of rather huge gains with practically every other investment strategy (this was when the FTSE 100 had recently hit 7,000 and the Dow 11,000) it was showing a small loss and this was considered embarrassing. Had it in fact been continued it would likely have outperformed most other strategies (and indeed the market) over the next year or so at least. It's true that data mining is a particular danger with such "mechanical" investing strategies and I've tried to reflect that in the paragraph I've inserted. User:Mswake 22:43, 26 Feb 2004 (UTC) Data mining is becoming more than just conventional processing; apparently it is now expanding into other fields including multimedia data mining. However, I'm not sure as whether extended data mining styles/methods can be covered in the data mining article or should be covered in another article? I added a simple example of what data mining offers (8:13 EST 21 Nov. 2004) ---- Except for maybe some very specific subfields, data mining seems to be managementspeak-ish. I mean in most cases where the work is used, looks like it is just a synonym for scientific method in its full generality. Discovering human-useful strategies from oracles in combinatorial games (e.g. from endgame tablebases for chess) [See the addition I've made in the page] is called data-mining, when it is called anything at all. (I published a paper once, in the early days of this thing - it was at first rejected, but when I resubmitted it with the only change being the title - I inserted the magic term "data mining" in the title, and the same journal promptly published it.) ---- The article reads well as an introduction to Data Mining, until the line mentioning the A Priori algorithm, at which point the technical level of the article shoots up and the reader is bombarded with terminology unfamiliar to anyone without a mathematical background. For example, no explanation of an ''oracle'' is given (the linked article deals with ancient mythological beings) and the writer makes many other assumptions about the reader's level of knowledge. Perhaps this paragraph could be put as a separate "technical details" section, or at least be re-written in a less dense (no pun) more patient manner. As a newcomer to the field, I found the article very lightweight. Some of the better introductions to the topic are available in the links that are cited. See other meanings of words starting from letter: DDA | DB | DC | DE | DF | DG | DH | DI | DJ | DK | DL | DM | DN | DO | DP | DR | DS | DT | DU | DW | DX | DY | DZ |Words begining with Data_mining: Data-mining Data_Mining Data_mining Data_mining Data_Mining_For_Busy_People Data_Mining_For_Very_Busy_People
Sponsored links: praca, nurkowanie.
|
These materials are based on Wikipedia and licensed under the GNU FDL
YouTube.com videos better site than Turbo Tax 2007 |
|
|