Annual 2-Day Course

Tools for Discovering Patterns in Data:

Extracting Value from Tables, Text, and Links

 

ERI Course Pic
Presenter: 
John Elder, Ph.D. 

Charlottesville, Virginia 
September 8-9, 2014
 
Download Registration Form here,
or register online.
 

 

Course Description:

Find the useful information hidden in your data! This course surveys computer-intensive methods for inductive classification and estimation, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of leading algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We'll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into powerful modern methods. The course emphasizes practical advice and focuses on the essential techniques of Resampling, Visualization, and Ensembles. Actual scientific and business examples will illustrate proven techniques employed by expert analysts. Along the way, major relative strengths and distinctive properties of the leading commercial software products for Data Mining will be discussed.

 

Instructor:

John F. Elder IV, Ph.D. heads a top data mining consulting team, based in Charlottesville, Virginia, and Washington DC. Founded in 1995, Elder Research, Inc. focuses on commercial, investment, and security applications of advanced analytics including stock selection, text mining, social networks, image recognition, biometrics, process optimization, drug efficacy, credit scoring, and fraud detection. John holds a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he’s an Adjunct Professor teaching Optimization or Data Mining. Prior to 18 years leading ERI, he spent 5 years in aerospace consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department. 

Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and was Chair of the 2009 Knowledge Discovery & Data Mining conference in Paris. He was honored to serve five years on a panel appointed by the President to guide technology for national security. He has co-authored award-winning books on practical data mining (2009) and ensemble modeling (2010). John is grateful to be a follower of Christ and the father of 5.

 

Intended Audience:

Those from industry and academia who work with data and wish to understand recent developments in pattern discovery, data mining, and inductive modeling. At the conclusion of this course, one should be able to discern the basic strengths of competing methods and select the appropriate tools for one's applications. Participants should have prior working experience with computers and interest in applied statistical techniques. (It helps, as well, to have a motivating application you wish to solve.)

 

Course Outline

I. Pattern Discovery: An Overview
  • Inducing Models from Data: Benefits and Dangers
  • Example Projects from Science and Business
  • Characteristics of successful projects
  • Leading Software Tools and Vendors
II. Classical Statistical Techniques (brief review)
  • Regression
  • Discriminant Analysis & Principle Components
  • Nearest Neighbors & Kernels
III. Modern Methods
  • Neural & Polynomial Networks
  • Decision Trees & MARS (Regression Splines)
IV. Key General Tools
  • Scientific Visualization: 
Grand Tour, Projection Pursuit, limitations
  • Bootstrapping/Resampling: Essential!
  • Bayes' Rule
  • Optimization: local and global
  • Overfit Control: Complexity Penalty, Smoothing, Shrinking, Generalized Degrees of Freedom
V. Data Trouble-Shooting
  • Case Diagnostics (Outlying, Influential, Leverage, & Missing points)
  • Feature Creation and Selection
VI. Text Mining
  • Stemming, Collocation, & Association Networks
  • Statistical vs. Language-dependent methods
  • “Bag of Words” & Vector Space
  • Focused Crawling & Active Learning
VII. Social Network Analysis
  • The power of the "network effect"
  • Visualization & modeling tools and examples
VIII. Comparing and Combining Algorithms
  • Adaptive model structure
  • Matching an algorithm to your application
  • Experimental test results
  • Combining models to improve accuracy
  • Bayesian Model Averaging
  • Bagging & Boosting
  • Why Ensembles work
IX. Top 10 Data Mining Mistakes
  • Lack data
  • Focus on Training
  • Rely on 1 technique
  • Ask the wrong question
  • Listen (only) to the data
  • Future leakage
  • Discount pesky cases
  • Extrapolate
  • Answer every inquiry
  • Sample without care
  • Believe the best model


PolyNet

A note about the course scope:

Each of the major topics discussed could comprise a semester-long course if presented in full detail! What this (intensive) short course provides is a broad overview of the highlights, drawing connections between major developments in the diverse fields that contribute to Predictive Analytics, including cutting-edge ways to mine text and graphical networks. Previous participants have found this "big picture" to be very useful for identifying techniques to use immediately, as well as approaches worthy of further exploration, for research or practical problem-solving.

 

Comments from previous attendees:

  • "[Dr. Elder] provided examples shedding light on complex concepts. He gave the big picture all along the way."
  • "Gave real practical insights from a practitioner's point of view."
  • "Finally someone told me how things are done, not just how great Data Mining is."
  • "Most valuable, were the insights into the essence of various methods, their relative strengths and weaknesses, and the important open research areas."
  • "Very interesting, knowledgeable, and entertaining approach."