Data Mining

Data Mining
Instructor : Jeff Phillips (email) | Office hours: Thursday morning 10-11am @ MEB 3442 (and directly after class in WEB L104 --> MEB 3442)
TAs: Benwei Shi (email) | Office hours: TBA
      + Mingxuan Han (email) | Office Hours: TBA
      + TBA (email) | Office Hours: TBA
      + TBA (email) | Office Hours: TBA
Spring 2020 | Mondays, Wednesdays 3:00 pm - 4:20 pm
WEB L104
Catalog number: CS 5140 01 or CS 6140 01

Syllabus

Description:
Data mining is the study of efficiently finding structures and patterns in large data sets. We will focus on several aspects of this: (1) converting from a messy and noisy raw data set to a structured and abstract one, (2) applying scalable and probabilistic algorithms to these well-structured abstract data sets, and (3) formally modeling and understanding the error and other consequences of parts (1) and (2), including choice of data representation and trade-offs between accuracy and scalability. These steps are essential for training as a data scientist.
Algorithms, programming, probability, and linear algebra are required tools for understanding these approaches.
Topics will include: similarity search, clustering, regression/dimensionality reduction, graph analysis, PageRank, and small space summaries. We will also cover several recent developments, and the application of these topics to modern applications, often relating to large internet-based companies.
Upon completion, students should be able to read, understand, and implement ideas from many data mining research papers.

Learning Objectives: On completion of this course students will be able to:

convert a structured data set (like text) into an abstract data representation such as a vector, a set, or a matrix, with modeling considerations, for use in downstream data analysis

implement and analyze touchstone data mining algorithms for clustering, dimensionality reduction, regularized regression, graph analysis, and locality sensitive hashing.

understand, discuss, and evaluate advanced data mining algorithms for clustering, dimensionality reduction, regularized regression, graph analysis, locality sensitive hashing, and managing noisy data.

work with team to design and execute a multi-faceted data mining project on data which is not already structured for the analysis task, and to compare and evaluate the design choices.

present progress and final results using written, oral, and visual media on a data analysis project to peers in small groups, to peers in large interactive environment, and to get approval from a superior.

Books:
The book for this course will mostly be a nearly-complete book on the Mathematical Foundation for Data Analysis (M4D), version v0.6. However, the lectures will follow more closely my related Data Mining course notes, and in several cases, these have not made it into the above book (yet?).
We will also often link to two other online resources that cover similar material, either with a more applied or theoretical focus:
MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
FoDS: Foundations of Data Science by Avrim Blum, John Hopcroft and Ravindran Kannan. This provide some proofs and formalisms not explicitly covered in lecture.

Videos: We plan to videotape all lectures, and make them available online. They will appear on this playlist on our YouTube Channel.
Videos will also livestream here.

Prerequisits: A student who is comfortable with basic probability, basic linear algebra, basic big-O analysis, and basic programming and data structures should be qualified for the class. A great primer on these can be found in the class text Mathematical Foundation for Data Analysis.
There is no specific language we will use. Python is often a good choice, although some parts may be simpler in just Matlab/Octave. However, programming assignments will often (intentionally) not be as specific as in lower-level classes. This will partially simulate real-world settings where one is given a data set and asked to analyze it; in such settings even less direction is provided.
For undergrads, the formal prerequisites are CS 3500 and CS 3190 (which has CS 3130 and Math 2270, or equivalant as pre/co-regs).
For graduate students, there are no enforced pre-requisites. Still it may be useful to review early material in Mathematical Foundation for Data Analysis (e.g., Chapters 1,3 and first parts of 2,5,7).
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most (but not all) have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.

For an example of what sort of mathematical material I expect you to be intimately familiar with, see chapters 1 and 3 in Mathematical Foundation for Data Analysis. Other relevant material from CS 3190 will be reviwed, but very rapidly.

Schedule: (subject to change)

Date	Topic (+ Notes)	Video	Link	Assignment (latex)	Project
Mon 1.06	Class Overview	vid
Wed 1.08	Statistics Principles (S)	vid	M4D 2.2-2.3 \| MMDS 1.2 \| FoDS 12.4
Mon 1.13	Similarity : Jaccard + k-Grams (S)	vid	M4D 4.3-4.4 \| MMDS 3.1 + 3.2 \| FoDS 7.3
Wed 1.15	Similarity : Min Hashing (S)	VID	M4D 4.6.6 \| MMDS 3.3	Statistical Principles
Mon 1.20	MLK DAY
Wed 1.22	Similarity : LSH (S)	vid	M4D 4.6 \| MMDS 3.4		Proposal
Mon 1.27	Similarity : Distances (S)	vid	M4D 4 - 4.3 \| MMDS 3.5 + 7.1 \| FoDS 8.1
Wed 1.29	Similarity : Word Embed + ANN vs. LSH (S)	vid	M4D 4.4 \| [Ethics Read] \| MMDS 3.7 + 7.1.3	Document Hash
Mon 2.03	Clustering : Hierarchical (S)	vid	M4D 8.5, 8.2 \| MMDS 7.2 \| FoDS 7.7
Wed 2.05	Clustering : K-Means (S)	vid	M4D 8-8.3 \| MMDS 7.3 \| FoDS 7.2-3	LSH
Mon 2.10	Clustering : Spectral (S)	vid	M4D 10.3 \| MMDS 10.4 \| FoDS 7.5
Wed 2.12	Streaming : Model and Misra-Greis (S)	vid	M4D 11.1 - 11.2.2 \| FoDS 6.2.3 \| MMDS 6+4.3 \| BF Analysis		Data Collection Report
Mon 2.17	PRESIDENTS DAY
Wed 2.19	Streaming : Count-Min Sketch, Count Sketch. and Apriori (S)	vid	M4D 11.2.3-4 \| FoDS 6.2.3 \| MMDS 6+4.3 \| BF Analysis	Clustering
Mon 2.24	Regression : Basics in 2-dimensions (S)	vid	M4D 5-5.3 \| ESL 3.2 and 3.4
Wed 2.26	Regression : Lasso + MP + Comp. Sensing (S)	vid	M4D 5.5 \| FoDS 10.2 \| Tropp + Gilbert	Frequent
Mon 3.02	Regression : Cross-Validation and p-values (S)	vid	[Ethics Read] \| M4D 5.5 \| ESL 3.8
Wed 3.04	MIDTERM TEST
Mon 3.09	SPRING BREAK
Wed 3.11	SPRING BREAK
Mon 3.16	Dim Reduce : SVD + PCA (S)	vid	M4D 7-7.3, 7.5 \| FoDS 4		Intermediate Report
Wed 3.18	Dim Reduce : more PCA, and Random Projections (S)	vid	M4D 7.10 \| FoDS 2.9
Mon 3.23	Dim Reduce : Matrix Sketching (S)	vid	M4D 11.3 \| MMDS 9.4 \| FoDS 2.7 + 7.2.2 \| arXiv
Wed 3.25	Dim Reduce : Metric Learning (S)	vid	M4D 7.6-7.8 \| LDA	Regression
Mon 3.30	Noise : Noise in Data (S)	vid	M4D 8.6 \| MMDS 9.1 \| Tutorial
Wed 4.01	Noise : Privacy (S)	vid	McSherry \| Dwork	Dim Reduce
Mon 4.06	Graph Analysis : Markov Chains (S)	vid	M4D 10.1 \| MMDS 10.1 + 5.1 \| FoDS 5 \| Weckesser
Wed 4.08	Graph Analysis : PageRank (S)	vid	M4D 10.2 \| MMDS 5.1 + 5.4
Mon 4.13	Graph Analysis : MapReduce (S)	vid	MMDS 2 \|		Final Report
Wed 4.15	Graph Analysis : Communities (S)	vid	M4D 10.4 \| MMDS 10.2 + 5.5 \| FoDS 8.8 + 3.4		Poster Outline
Mon 4.20	ENDTERM TEST
Wed 4.22				Graphs
Fri 4.24	Poster Day !!! (3:30-5:30pm)				Poster Presentation

This course follows the SoC Guidelines

Latex: I recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures. Overleaf provides a simple free web interface.