Data Mining

Data Mining
Instructor : Jeff Phillips (email) | Office hours: Thursdays 9-10am @ MEB 3404 (and directly after class)
TAs: TBA (email) | Office hours: TBA
Fall 2025 | Mondays, Wednesdays 1:25 pm - 2:45 pm
Lectures will take place in S BEH AUD and probably streamed on YouTube, where they will also be archived.
Catalog number: CS 5140 01 or CS 6140 01 or DS 4140 01

Syllabus

Description:
Data mining is the study of efficiently finding structures and patterns in large data sets. We will focus on un-supervised aspects of this focusing on fundamental principals which serve as building blocks for much of data analysis (including ML and AI): (1) converting from a messy and noisy raw data set to a structured and abstract one, (2) applying scalable and probabilistic algorithms to these well-structured abstract data sets, and (3) formally modeling and understanding the error and other consequences of parts (1) and (2), including choice of data representation and trade-offs between accuracy and scalability. These steps are essential for training as a data scientist.
Algorithms, programming, probability, and linear algebra are required tools for understanding these approaches.
Topics will include: similarity search, choices of distance measures, clustering, dimensionality reduction, graph analysis, and small space summaries. We will also cover several recent developments, and the application of these topics to modern applications, often relating to large internet-based companies.
Upon completion, students should be able to read, understand, and implement ideas from many data mining research papers. As the field is ever-evolving, this will be an important skill to stay relevant in the field.

Learning Objectives: On completion of this course students will be able to:

convert a structured data set (like text) into an abstract data representation such as a vector, a set, or a matrix, with modeling considerations, for use in downstream data analysis

implement and analyze touchstone data mining algorithms for clustering, dimensionality reduction, graph analysis, and fast similarity search.

understand, discuss, and evaluate advanced data mining algorithms for clustering, dimensionality reduction, graph analysis, fast similarity search, and managing noisy data.

work with a team to design and execute a multi-faceted data mining project on data which is not already structured for the analysis task, and to compare and evaluate the design choices.

present progress and final results using written, oral, and visual media on a data analysis project to peers in small groups, to peers in large interactive environment, and to get approval from a superior.

Books:
The book for this course will mostly be from a book on the Mathematical Foundations for Data Analysis (M4D).
We will also often link to two other online resources that cover similar material, either with a more applied or theoretical focus:
MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
FoDS: Foundations of Data Science by Avrim Blum, John Hopcroft and Ravindran Kannan. This provide some proofs and formalisms not explicitly covered in lecture.
However the course will attempt to venture into discussions on applications beyond the more technical material covered in those books.

Videos: We (probably) plan to record all lectures, and make them available online. Then they will appear on this playlist on our YouTube Channel.
Videos should also livestream here.

Prerequisits: A student who is comfortable with basic probability, basic linear algebra, basic big-O analysis, and basic programming and data structures should be qualified for the class. A great primer on these can be found in the class text Foundations of Data Analysis.
Python will be the default supported programming language for the course. Students may choose to complete some aspects in other languages, but we do not plan to support help in those cases.
For undergrads, the formal prerequisites are CS 3500 and CS 3190 (which itself has CS 3130 and Math 2270, or equivalant as pre/co-regs).
For graduate students, there are no enforced pre-requisites. Still it may be useful to review early material in Mathematical Foundations for Data Analysis (e.g., Chapters 1,3 and first parts of 2,5,7).
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most (but not all) have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.

For an example of what sort of mathematical material I expect you to be intimately familiar with, see chapters 1 and 3 in Mathematical Foundations for Data Analysis. Other relevant material from CS 3190 will be reviwed, but very rapidly.

Schedule: (subject to change)

Date	Topic (+ Notes)	Video	Link	Assignment (latex)	Project
Mon 8.18	Class Overview	-
Wed 8.20	Statistics Phenomena (S)	-	M4D 2.2-2.3 \| MMDS 1.2 \| FoDS 12.4
Mon 8.25	Similarity : Language Embeddings (S)	-	M4D 4.4
Wed 8.27	Similarity : Metric Distances (S)	-	M4D 4 - 4.3 \| MMDS 3.5 + 7.1 \| FoDS 8.1	Statistical Phenomenon
Mon 9.01	Labor DAY
Wed 9.03	Similarity : ANN: HNSW GraphSearch (S)	-	M4D 4.4 \| MMDS 3.7 + 7.1.3
Mon 9.08	Similarity : Jaccard + Distribution Distances (S)	-	M4D 4.4
Wed 9.10	Similarity : K-grams -> LSH (S)	-	M4D 4.3-4.4 + 4.6 \| MMDS 3.1 + 3.2 + 3.3 + 3.4\| FoDS 7.3		Proposal
Mon 9.15	Clustering : Hierarchical (S)	-	M4D 8.5, 8.2 \| MMDS 7.2 \| FoDS 7.7
Wed 9.17	Clustering : K-Means (S)	-	M4D 8-8.3 \| MMDS 7.3 \| FoDS 7.2-3	Similarity
Mon 9.22	Clustering : Spectral (S)	-	M4D 10.3 \| MMDS 10.4 \| FoDS 7.5
Wed 9.24	Clustering : Choosing k	-	M4D 10.3		Data Collection Report
Mon 9.29	Streaming : Model, Sampling, and Quantiles	-	MMDS 4.3
Wed 10.01	Streaming : Misra-Greis and Count-Min (S)	-	M4D 11.1 - 11.2.2 \| FoDS 6.2.3 \| MMDS 6	Clustering
Mon 10.06	FALL BREAK
Wed 10.08	FALL BREAK
Mon 10.13	Streaming : Count Sketch, Distinct Counting, and Apriori (S)	-	M4D 11.2.3-4 \| FoDS 6.2.3 \| MMDS 4.3
Wed 10.15	Streaming : Misc	-	MMDS 4.1	Frequency Estimation
Wed 10.15	MIDTERM TEST (practice)
Mon 10.20	Dim Reduce : SVD + PCA (S)	-	M4D 7-7.3, 7.5 \| FoDS 4
Wed 10.22	Dim Reduce : Random Projections (S)	-	M4D 7.1, 7.10 \| FoDS 2.7		Intermediate Report
Mon 10.27	Dim Reduce : Matrix Sketching (S)	-	M4D 11.3 \| MMDS 9.4 \| FoDS 2.7 + 7.2.2 \| arXiv
Wed 10.29	Dim Reduce : Metric Learning (S)	-	M4D 7.6-7.8 \| LDA \| LDML
Mon 11.03	Noise : Outliers and Robust Estimation (S)	-	M4D 7.10 + 8.6 \| MMDS 9.1 \| FoDS 2.9 \| Tutorial \| robust mean
Wed 11.05	Noise : Anomaly Detection (S)	-		Dimensionality Reduction
Mon 11.10	Noise : Encoding Concepts and Bias		[Ethics Read, VERB]
Wed 11.12	Noise : (Differential) Privacy (S)	-	McSherry \| Dwork
Wed 11.19	Graph Analysis : Markov Chains (S)	-	M4D 10.1 \| MMDS 10.1 + 5.1 \| FoDS 5 \| Weckesser
Mon 11.24	Graph Analysis : PageRank (S)	-	M4D 10.2 \| MMDS 5.1 + 5.4
Wed 11.26	Graph Analysis : Graph Embeddings + Review (S)	-		Graphs + Noise	Final Report
Mon 12.01	Graph Analysis : Communities (S)	-	M4D 10.4 \| MMDS 10.2 + 5.5 \| FoDS 8.8 + 3.4
Wed 12.03	ENDTERM TEST (practice)
Thu 12.04					Poster Outline
Fri 12.12	Poster Day !!! (1:00-3:00pm)				Poster Presentation

This course follows the SoC Guidelines

Latex: I recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures. Overleaf provides a simple free web interface.