Data Mining
Instructor : Jeff Phillips (email) | Office hours: Thursday morning 10-11am @ MEB 3442 (and directly after class in WEB L104)
TAs: Sunipa Dev (email) | Office hours: Monday 11am-1pm, MEB 3115
      + Maryam Baryouti (email) | Office Hours: Wednesdays noon-1pm; Thursdays 2-3pm, MEB 3115
      + Yang Gao (email) | Office Hours: Tuesday 8-9pm, (online, TBD)
      + Trang Tran (email) | Office Hours: Monday 4:30-5:30pm; Thursdays 4-5pm, MEB 3115
      + Sanjeev Sandeep (email) | Office Hours: Tuesday 10:30am-12:30, MEB 3115
Spring 2018 | Mondays, Wednesdays 3:00 pm - 4:20 pm
WEB L104
Catalog number: CS 5140 01 or CS 6140 01

Data mining is the study of efficiently finding structures and patterns in large data sets. We will focus on several aspects of this: (1) converting from a messy and noisy raw data set to a structured and abstract one, (2) applying scalable and probabilistic algorithms to these well-structured abstract data sets, and (3) formally modeling and understanding the error and other consequences of parts (1) and (2), including choice of data representation and trade-offs between accuracy and scalability. These steps are essential for training as a data scientist.
Algorithms, programming, probability, and linear algebra are required tools for understanding these approaches.
Topics will include: similarity search, clustering, regression/dimensionality reduction, graph analysis, PageRank, and small space summaries. We will also cover several recent developments, and the application of these topics to modern applications, often relating to large internet-based companies.
Upon completion, students should be able to read, understand, and implement ideas from many data mining research papers.

The ``book'' for this course will be MY OWN COURSE NOTES serve as the defacto book. However, the following two free online books may serve as useful references that have good overlap with the course.
MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
FoDS: Foundations of Data Science by Avrim Blum, John Hopcroft and Ravindran Kannan. This provide some proofs and formalisms not explicitly covered in lecture.
M4DA: Math for Data Analysis by Jeff M. Phillips. This is a gradual intropduction to many of the topics this course builds on.

Videos: We plan to videotape all lectures, and make them available online. They will appear on this playlist on our YouTube Channel.
Videos will also livestream here.

Prerequisits: A student who is comfortable with basic probability, basic linear algebra, basic big-O analysis, and basic programming and data structures should be qualified for the class. A great primer on the Mathematics of Data Analysis can be found in the linked book.
There is no specific language we will use. However, programming assignments will often (intentionally) not be as specific as in lower-level classes. This will partially simulate real-world settings where one is given a data set and asked to analyze it; in such settings even less direction is provided.
For undergrads, the formal prerequisites are CS 3500 and CS 3130 and MATH 2270 (or equivalent), and CS 4150 is a corequisite. We recommend undergraduates take a new course CS 4964 (Foundations of Data Analysis) before this course, but it is not currently required, and many students have done well without having taken this course. I will grant exceptions to the pre-requisites for students with (a reasonable grade in) Foundations of Data Analysis.
For graduate students, there are no enforced pre-requisites. Still it may be useful to review material in the Math for Data book
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most (but not all) have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.

For an example of what sort of mathematical material I expect you to be to be familiar with, see these notes on probability and linear algebra.
Schedule: (subject to change)
Date Topic (+ Notes) Video Link Assignment (latex) Project
Mon 1.08 Class Overview VID MMDS 1.1
Wed 1.10 Statistics Principles VID M4DA 3 | MMDS 1.2 | FoDS 12.4
Mon 1.15
Wed 1.17 Similarity : Jaccard + k-Grams (S) VID MMDS 3.1 + 3.2 | FoDS 7.3
Mon 1.22 Similarity : Min Hashing (S) VID MMDS 3.3
Wed 1.24 Similarity : LSH (S) VID MMDS 3.4 Statistical Principles
Mon 1.29 Similarity : Distances (S) VID MMDS 3.5 + 7.1 | FoDS 8.1
Wed 1.31 Similarity : Word Embed + ANN vs. LSH (S) VID [Ethics Read] | MMDS 3.7 + 7.1.3 Proposal
Mon 2.05 Clustering : Hierarchical (S) VID MMDS 7.2 | FoDS 7.7
Wed 2.07 Clustering : K-Means (S) VID M4DA 8 | MMDS 7.3 | FoDS 7.2-3
Mon 2.12 Clustering : Spectral (S) VID MMDS 10.4 | FoDS 7.5
Wed 2.14 Streaming : Misra-Greis and Frugal (S) VID MMDS 4.1 | FoDS 6.2.3 Document Hash
Mon 2.19
Wed 2.21 Streaming : Count-Min + Apriori Algorithm (S) VID MMDS 6+4.3 | BF Analysis Data Collection Report
Mon 2.26 Regression : Basics in 2-dimensions (S) VID M4DA 5 | ESL 3.2 and 3.4
Wed 2.28 Regression : SVD + PCA (S) VID M4DA 4 and 7 | FoDS 4 Clustering
Mon 3.05 Regression : Metric Learning (S) VID M4DA 7 | LDA
Wed 3.07
Mon 3.12 Regression : Comp. Sensing and OMP (S) VID FoDS 10.2 | Tropp + Gilbert
Wed 3.14 Regression : L1 Regression and Lasso (S) VID [Ethics Read] | ESL 3.8 Frequent
Mon 3.19
Wed 3.21
Mon 3.26 Regression : Random Projections (S) VID FoDS 2.9 Intermediate Report
Wed 3.28 Regression : Matrix Sketching (S) VID MMDS 9.4 | FoDS 2.7 + 7.2.2 | arXiv
Mon 4.02 Noise : Noise in Data (S) VID MMDS 9.1 | Tutorial
Wed 4.04 Noise : Privacy (S) VID McSherry | Dwork Regression
Mon 4.09 Graph Analysis : Markov Chains (S) VID MMDS 10.1 + 5.1 | FoDS 5 | Weckesser Regression
Wed 4.11 Graph Analysis : PageRank (S) VID MMDS 5.1 + 5.4
Mon 4.16 Graph Analysis : MapReduce (S) VID MMDS 2 | Final Report
Wed 4.18 Graph Analysis : Communities (S) VID MMDS 10.2 + 5.5 | FoDS 8.8 + 3.4 Poster Outline
Mon 4.23
Mon 4.30 Graphs
Wed 5.02 Poster Day !!! (3:30-5:30pm) Poster Presentation

This course follows the SoC Guidelines

Latex: I highly highly recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures.