Abstract
This one-day course, intended both for those who wish to learn what
data mining is all about and for those who have experience using data
mining techniques, will show how to use data mining techniques
effectively in practice. It will be a practical, hands-on introduction
to the methods, based on case-studies from my experience consulting in
science and industry. Although we will explain the mathematics behind
the methodologies used, we will focus on the practical issues of getting
results out of data mining. Some knowledge of statistical modeling,
especially regression techniques will be useful.
Outline of the Day
OVERVIEW OF DATA MINING
-
Why do data mining?
-
Where is data mining used?
-
What's hard/different about data mining?
-
Types of models:
predictive (classification, regression, time series);
descriptive
(clustering, association detection, sequence detection)
-
The data mining process
STARTING THE PROCESS
-
Stating the business problem
-
Description of the data sets
-
Exploring the data with trees
-
Graphics for large data sets
INTRODUCTION TO MODELING
-
Commonly used algorithms: classical regression (linear and
non-linear), logistic regression, decision trees, neural
networks, K-nearest neighbor methods, clustering, bagging
(Random Forests), boosting and Naive Bayes methods.
-
The cycle of model building
-
Need for validation - methods of validation
BUILDING THE MODEL
-
Choosing appropriate algorithms: matching algorithms to the
business problem; matching algorithms to the data; no best
algorithm
-
Selecting data: columns (reducing dimensionality); rows
(sampling)
-
Transforming the data: data representation (scaling,
binning, encoding)
-
Creating new attributes
MODEL EVALUATION
-
Confusion matrices
-
Lift and ROI curves
WHAT CAN GO WRONG
-
Overfitting
-
Performance
-
Interpretation
-
Model limitations
|
Biography
Dr.
Dick De Veaux holds degrees in Civil Engineering (B.S.E. Princeton),
Mathematics (A.B.Princeton), Physical Education (M.A. Stanford;
Specialization in Dance) and Statistics (Ph.D., Stanford). He has taught
at the Wharton School, the Princeton University School of Engineering,
and, since 1994, has been a professor in the Math and Stat Department of
Williams College. He has won numerous teaching awards including a
"Lifetime Award for Dedication and Excellence in Teaching" from the
Engineering Council at Princeton. He returned to Princeton in 2006 as
the William R. Kenan Jr. Visiting Professor for Distinguished Teaching.
He has won both the Wilcox on and Shewell awards (twice) from the
American Society for Quality and is a fellow of the ASA and an elected
member of the International Statistical Society (ISI). He has served as
General Methodology Chair for the JSM Program Committee 3 times, in
1987, 1995 and 1999. He served as program chair for SPES in 1996. He was
the Program Chair for the 2001 JSM in Atlanta. He currently serves on
the Board of Directors of the ASA as a Council of Sections
representative. In 2008 he was named the Mosteller Statistician of the
Year by the Boston Chapter of the ASA.
Dick has been a consultant
for nearly 30 years for such companies as American Express,
Hewlett-Packard, Alcoa, First USA bank, Dupont, Pillsbury, Rohm and
Haas, Ernst and Young, General Electric, and Chemical Bank. He holds two
U.S. patents and is the author of over 30 refereed journal articles. His
hobbies include cycling, swimming, singing (barbershop, doo wop and
classical -- he is the head of the Diminished Faculty, a local doo wop
group and is a frequent soloist in local choirs) and dancing (he was
once a professional dancer and has a masters degree in dance education).
He is the father of four (ages 24, 22, 20 and 18). He is the co-author,
with Paul Velleman and David Bock, of the critically acclaimed textbooks
"Intro Stats", "Stats: Modeling the World" and "Stats: Data and Models" all published by Addison-Wesley (Pearson). He is the co-author, with Norean Sharpe and Paul Velleman of the books
"Business Statistics" and
"Business Statistics: A First Course" published by Pearson.
Schedule
08:00-08:45 |
Registration & Breakfast |
08:45-09:00 |
Welcoming Remarks
Wendy Lou, Xin Gao, Leonid Khinkis, Ruth Croxford
Introduction Georges Monette,
|
09:00-10:00 |
OVERVIEW OF
DATA MINING
Dick De Veaux |
10:00-10:45 |
STARTING THE
PROCESS
Dick De Veaux |
10:45-11:00 |
Coffee Break |
11:00-12:30 |
INTRODUCTION
TO MODELING
Dick De Veaux |
12:30-13:30 |
Networking Lunch
Career-Advice Panel Ruth Croxford, Tim Gravelle, Janet
McDougall, Tony Panzarella, Eric Cai |
13:00-14:00 |
Poster Presentations |
14:00-15:15 |
BUILDING THE MODEL Dick De Veaux |
15:15-15:30 |
Coffee Break |
15:30-16:30 |
MODEL EVALUATION Dick De Veaux |
16:30-17:00 |
WHAT CAN GO WRONG Dick De Veaux |
|
|
17:00-17:15 |
Awards Presentation & Closing Remarks
Wendy Lou, Paul Corey, Georges Monette |
|
|
17:15-18:00 |
SORA Annual General Meeting Xin Gao, Hugh
McCague |
|
|
18:00-20:00 |
Networking Dinner (optional, pay for your own)
at University of
Toronto, Faculty Club |
|