Course Description:
Find the useful information hidden in your data! This course surveys computer-intensive methods for
inductive classification and estimation, drawn from
Statistics, Machine Learning, and Data Mining. Dr.
Elder will describe the key inner workings of leading
algorithms, compare their merits, and (briefly)
demonstrate their relative effectiveness on practical
applications. We'll first review classical statistical
techniques, both linear and nonparametric, then
outline the ways in which these basic tools are
modified and combined into powerful modern
methods. The course emphasizes practical advice and
focuses on the essential techniques of Resampling,
Visualization, and Ensembles. Actual scientific and
business examples will illustrate proven techniques
employed by expert analysts. Along the way, major
relative strengths and distinctive properties of the
leading commercial software products for Data
Mining will be discussed.
Course Material:
Attendees will receive the 864-page book, Handbook of Statistical Analysis and Data Mining Applications, by Drs. Nisbet, Elder, and Miner, winner of the 2009 PROSE Award in Mathematics. You will also receive limited-time but fully-functioning software from leading vendors, including SAS, SPSS-IBM, StatSoft, and Salford Systems, and will have the option to attend demos of these powerful tools in action.
Relevant Material:
Another book by Drs. John Elder and Giovanni Seni, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, came out in February. It teaches why and how to employ this breakthrough method to build powerful analytic models, and illustrates the algorithms with R code. The book is available as a PDF (Morgan & Claypool), and Paperback (Amazon).
Instructor:
John Elder is Chief Scientist of Elder Research Inc., a Data Mining consulting firm in Charlottesville, Virginia. He has over twenty years of experience developing and applying adaptive, data-driven techniques to practical problems - at an engineering consulting firm, an investment management company, Rice University, and the University of Virginia. Dr. Elder has written and spoken widely on pattern discovery topics, is active on statistical and engineering journals and boards, and has authored some influential data mining tools. His practical experience with commercial applications - including credit scoring, direct marketing, sales forecasting, market timing, and fraud detection - help illustrate the course concepts.
Intended Audience:
Those who work with data and wish to understand
and use recent developments in predictive analytics.
At the conclusion of this course, you should be able
to discern the basic strengths of competing methods
and select the appropriate tools for your applications.
Participants should have experience with computers
and interest in applied statistical techniques. Best,
have a motivating application you wish to solve!
Course Outline
I. Pattern Discovery: An Overview
- Inducing Models from Data: Benefits and Dangers
- Example Projects from Science and Business
- Characteristics of successful projects
- Leading Software Tools and Vendors
II. Classical Statistical Techniques (brief review)
- Regression
- Discriminant Analysis & Principle Components
- Nearest Neighbors & Kernels
III. Modern Methods
- Neural & Polynomial Networks
- Decision Trees & MARS (Regression Splines)
IV. Key General Tools - Scientific Visualization: Grand Tour, Projection Pursuit, limitations
- Bootstrapping/Resampling: Essential!
- Bayes Rule
- Optimization: local and global
- Overfit Control: Complexity Penalties, Smoothing, Shrinking, Generalized Degrees of Freedom
|
V. Data Trouble-Shooting - Case Diagnostics (Outlying, Influential, Leverage, & Missing points)
- Feature Creation and Selection
VI. Text Mining
- Stemming, Collocation, & Association Networks
- Statistical vs. Language-dependent methods
- "Bag of Words" and Vector Space
- Focused Crawling & Active Learning
VII. Social Network Analysis
- The power of the "network effect"
- Visualization & modeling tools and examples
VIII. Comparing and Combining Algorithms - Adaptive model structure
- Matching an algorithm to your application
- Experimental test results
- Combining models to improve accuracy
- Bayesian Model Averaging
- Bagging & Boosting
- Why Ensembles work
IX. Top 10 Data Mining Mistakes - Lack data, Focus on Training, Rely on 1 technique, Ask the wrong question, Listen (only) to the data, Future leakage, Discount pesky cases, Extrapolate, Answer every inquiry, Sample without care, Believe the best model.
|

A note about the course scope:
Each of the major topics discussed could comprise a semester-long course if presented in full detail! What this (intensive) short course provides is a broad overview of the highlights, drawing connections
between major developments in the diverse fields that
contribute to Predictive Analytics, including cutting-edge
ways to mine text and graphical networks.
Previous participants have found this "big picture" to
be very useful for identifying techniques to use
immediately, as well as approaches worthy of further
exploration, for research or practical problem-solving.
Comments from previous attendees:
- "[Dr. Elder] provided examples shedding light on complex concepts. He gave the big picture all along the way."
- "Gave real practical insights from a practitioner's point of view."
- "Finally someone told me how things are done, not just how great Data Mining is."
- "Most valuable, were the insights into the essence of various methods, their relative strengths and weaknesses, and the important open research areas."
- "Very interesting, knowledgeable, and entertaining approach."