Home ProgramAbstractsRegstrationCommiteesAccomodation
 
Successful Data Mining in Practice
 

Abstract

This one-day course, intended both for those who wish to learn what data mining is all about and for those who have experience using data mining techniques, will show how to use data mining techniques effectively in practice. It will be a practical, hands-on introduction to the methods, based on case-studies from my experience consulting in science and industry. Although we will explain the mathematics behind the methodologies used, we will focus on the practical issues of getting results out of data mining. Some knowledge of statistical modeling, especially regression techniques will be useful.

Outline of the Day

OVERVIEW OF DATA MINING

  • Why do data mining?
  • Where is data mining used?
  • What's hard/different about data mining?
  • Types of models: predictive (classification, regression, time series); descriptive (clustering, association detection, sequence detection)
  • The data mining process

STARTING THE PROCESS

  • Stating the business problem
  • Description of the data sets
  • Exploring the data with trees
  • Graphics for large data sets

INTRODUCTION TO MODELING

  • Commonly used algorithms: classical regression (linear and non-linear), logistic regression, decision trees, neural networks, K-nearest neighbor methods, clustering, bagging (Random Forests), boosting and Naive Bayes methods.
  • The cycle of model building
  • Need for validation - methods of validation

BUILDING THE MODEL

  • Choosing appropriate algorithms: matching algorithms to the business problem; matching algorithms to the data; no best algorithm
  • Selecting data: columns (reducing dimensionality); rows (sampling)
  • Transforming the data: data representation (scaling, binning, encoding)
  • Creating new attributes

MODEL EVALUATION

  • Confusion matrices
  • Lift and ROI curves

WHAT CAN GO WRONG

  • Overfitting
  • Performance
  • Interpretation
  • Model limitations


Biography


Dr. Dick De Veaux holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B.Princeton), Physical Education (M.A. Stanford; Specialization in Dance) and Statistics (Ph.D., Stanford). He has taught at the Wharton School, the Princeton University School of Engineering, and, since 1994, has been a professor in the Math and Stat Department of Williams College. He has won numerous teaching awards including a "Lifetime Award for Dedication and Excellence in Teaching" from the Engineering Council at Princeton. He returned to Princeton in 2006 as the William R. Kenan Jr. Visiting Professor for Distinguished Teaching. He has won both the Wilcox on and Shewell awards (twice) from the American Society for Quality and is a fellow of the ASA and an elected member of the International Statistical Society (ISI). He has served as General Methodology Chair for the JSM Program Committee 3 times, in 1987, 1995 and 1999. He served as program chair for SPES in 1996. He was the Program Chair for the 2001 JSM in Atlanta. He currently serves on the Board of Directors of the ASA as a Council of Sections representative. In 2008 he was named the Mosteller Statistician of the Year by the Boston Chapter of the ASA.

Dick has been a consultant for nearly 30 years for such companies as American Express, Hewlett-Packard, Alcoa, First USA bank, Dupont, Pillsbury, Rohm and Haas, Ernst and Young, General Electric, and Chemical Bank. He holds two U.S. patents and is the author of over 30 refereed journal articles. His hobbies include cycling, swimming, singing (barbershop, doo wop and classical -- he is the head of the Diminished Faculty, a local doo wop group and is a frequent soloist in local choirs) and dancing (he was once a professional dancer and has a masters degree in dance education). He is the father of four (ages 24, 22, 20 and 18). He is the co-author, with Paul Velleman and David Bock, of the critically acclaimed textbooks "Intro Stats", "Stats: Modeling the World" and "Stats: Data and Models" all published by Addison-Wesley (Pearson). He is the co-author, with Norean Sharpe and Paul Velleman of the books "Business Statistics" and "Business Statistics: A First Course" published by Pearson.



Schedule

08:00-08:45 Registration & Breakfast
08:45-09:00 Welcoming Remarks
Wendy Lou, Xin Gao, Leonid Khinkis, Ruth Croxford

Introduction
Georges Monette, Marianne Messina
09:00-10:00 OVERVIEW OF DATA MINING
Dick De Veaux
10:00-10:45 STARTING THE PROCESS
Dick De Veaux
10:45-11:00 Coffee Break
11:00-12:30 INTRODUCTION TO MODELING
Dick De Veaux
12:30-13:30 Networking Lunch

Career-Advice Panel
Ruth Croxford, Tim Gravelle, Janet McDougall, Tony Panzarella, Eric Cai
13:00-14:00 Poster Presentations
14:00-15:15 BUILDING THE MODEL
Dick De Veaux
15:15-15:30 Coffee Break
15:30-16:30 MODEL EVALUATION
Dick De Veaux
16:30-17:00 WHAT CAN GO WRONG
Dick De Veaux
   
17:00-17:15 Awards Presentation & Closing Remarks
Wendy Lou, Paul Corey, Georges Monette
   
17:15-18:00 SORA Annual General Meeting
Xin Gao, Hugh McCague
18:00-20:00 Networking Dinner (optional, pay for your own)
at University of Toronto, Faculty Club