From CCSU's Daniel Larose, who posted this to several AI-related newsgroups: CCSU Launches Online Certificate in Data Mining Central Connecticut State University (CCSU) announces the launching of a Certificate program in data mining, available completely online. The Certificate in Data Mining is the world's first such program to be made available completely online, according to Daniel T. Larose, Ph.D., Associate Professor of Statistics and Program Coordinator for Data Mining at CCSU. The Certificate consists of four undergraduate courses, and may be completed in one year. The thrust of the program is to provide students with practical, hands-on experience with what the MIT Technology Review called one of the ten emerging technologies that will change the world (MIT Technology Review, Jan / Feb 2001). Students will apply such methodologies as decision trees, market basket analysis, neural networks, classification rules, and cluster detection. Students will gain strong exposure to the Clementine data mining software suite from SPSS, which is ideally suited to an online program, since student versions are available. The first course in the Certificate sequence begins online in September. Also beginning in September is an online graduate course in data mining, along with online courses in mathematical statistics, experimental design, JAVA, and calculus. Online registration is taking place now at onlinecsu.ctstateu.edu. For more information, visit the Data Mining at CCSU website at www.math.ccsu.edu/dm or contact the Program Coordinator at [email protected]. Cheers, ------------------ Tom Head www.tomhead.net
Some may ask, "What the heck is Data Mining?" Others may say, "I didn't get that Data Mining joke." Well it's not a joke. Let me give you a real world example that I think is interesting. There's a large retail chain that has a Database in the 100's of Terabytes range (a terabyte is 1000 gigabytes). This one database contains the sales detail for all the stores in the retail chain. Using data mining, they notice one morning that the day before two of their stores had sold a lot more drinking straws than was usual. These two had sold a lot more drinking straws than their other stores. The head office called the two stores and discovered that these stores had moved drinking straws next to the Kool-aid display because of the sale on Kool-aid that had just started. By the next day all the stores had moved drinking straws next to the Kool-aid. By the end of the Kool-aid sale it was estimated that the increased net on straw sales was something like 2 or 3 million dollars, IIRC. Now you can imagine that their computer system is expensive to buy and maintain but with results like that, they can't afford not to do data mining.
The term "Data Mining" has more questionable associations in the field of epidemiology. If you have surfed onto "junkscience.com" or have read books by Stephen J. Malloy (sp?)--this sites' founder--you'll grasp the pejorative use of the term. But since I have an old friend now at MIT teaching statistics, I'll email him and invite his comment.
I can't help but remember really funny Dilbert cartoon about data mining. Dogbert is a specialist in data mining. He wears a miner's outfit (a big light attached to his forehead) and is looking for some hidden messages from God. I would add 'knowledge management' to the same category as 'data mining'.
That would be interesting, because I don't think there is any "shady" side to data mining; there are shady researchers, however. That seems to be the jist of the anecdotes on the junkscience.com site. Some epidemilogists with an agenda (or the EPA as a whole) it is alleged relaxed the criteria to advance a statistical observance to cause-and-effect status. In some anecdotes, some of the epidemiologist did commit other mortal research sins as well, including ignoring evidence that did not back their agenda/research. To me, those were not data mining stories(data dredging in the prejorative of the website); they were agenda-driven research stories. I didn't read the mentioned book and only surveyed the anecdotes, so the author may have issues with data mining. However, the anecdotal story in another post shows how data mining is used when the only agenda is to discover new, true information from unstructured or severally structured data and make use of it...why drinking straws sold better in some stores than others.
I thought the main point of "data mining" is to essentially spy on people--their personal information, their buying habits, the sites they visit on the Web, etc.
Gathering marketing information is certainly one common goal, but I don't think that analyzing web visitation data meets the definition of data mining. The epidemiology post gives an example of other uses, in this case, looking at habits such as smoking and correlating them to human disease. (The identity of the smokers is irrelevant although demographic information has value.) Another example might be an expert system that calculates the insurability of an individual or a company based on statistical analysis gleened from a data mining tool. The reason that I think that web visitation analysis is not by itself data mining is that the original purpose of data mining was to provide data analysis tools for information stored in a wide variety of formats...unstructured text, semi-structured text of many types, old ISAM databases, and various new database formats. Frequently, the objects of the data mining were data from legacy data processing systems in corporate archives going back many years, including pre-web years. Analyzing all this data presented a difficult software development task and tools emerged to handle the various file formats and provide powerful statistical analysis. Looking at web statistics alone is actually a very simple task. The web servers provide nicely structured data that is easily crunched. Of course, this information could be thrown into the general data mining pool and used in combination with other information.
My view of data mining is that it is just one of the more recent CS/IS buzz words that are being thrown about recently. As computer hardware gets cheaper and the ability to collect larger amounts of data from different sources grows the size of databases are growing very quickly. The ability to do analysis on these large stores of data is quickly growing thanks to the faster cheaper hardware and the more powerful software tools being made available. The true cutting edge for these technologies is in business not in academia. I believe that it moved from academia to business between 10 and 20 years ago as SQL database systems started being developed. Data mining, OLAP, and data warehousing are well known concepts that are in widespread use throughout the industry.
One more thought...the questionable associations might com from datacrats who use datamining without having a firm knowledge of statistics and research. A common example is someone who concludes from datamining that obesity is caused by Diet Coke because that is what obese people drink more than anything else. I guess it is a matter of taste whether that is just bad science or datamining gone amok. I guess the name is expanding in meaning. I don't know exactly when the term first appeared, but the concepts go back to about 1989 or 1990 with some "knowledge engineering" folks.