Data Mining
Instructor : Jeff Phillips (email) | Office hours: Thursday 1:30pm-2:30pm @ MEB 3442 (and often directly after class)
TAs: Reza Esfadani (email) | Office hours: Tuesday 11am-noon @ 3115 MEB
+ Raghvendra Singh (email) | Office Hours: Monday 3-4pm @ 3423 MEB (near grad launge)
Spring 2014 | Mondays, Wednesdays 5:15 pm - 6:35 pm
WEB 2230
Catalog number: CS 5140 01 or CS 6140 01

Data mining is the study of efficiently finding structures and patterns in data sets. We will also study what structures and patterns you can not find. The structure and patterns are based on statistical and probabilistic principals, and they are found efficiently through the use of clever algorithms. This class will take this two-pronged approach to the topic; we will understand the model and then explore efficient algorithms to find them.
This class may differ greatly from many data mining classes offered elsewhere. Perhaps it should be called "Large Scale Data Mining" since many of the techniques we will discuss have been designed to deal with (or have survived the onslaught) of very large scale data. Many of these techniques use randomized algorithms - these are often extremely simple to use, but more difficult to analyze. We will focus more on how to use, and give explanations (but often not proofs) of correctness.
Topics will include: similarity search, clustering, regression/dimensionality reduction, link analysis (PageRank), and small space summaries. We may also discuss anomaly detection, compressed sensing, and pattern matching.

MMDS(v1.3): Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
CSTIA: Computer Science Theory for the Information Age by John Hopcroft and Ravi Kannan. This is currently only collated lecture notes from a theory class that covers some similar topics. This provide some proofs and formalisms not explicitly covered in lecture.
When material is not covered by the books, free reference material will be linked to or produced.

Videos: We plan to videotape all lectures, and make them available online. We plan to place them on a class YouTube Channel.

Prerequisits: A student who is comfortable with basic probability, basic big-O analysis, and simple programming will be qualified for the class. There is not specific languange we will use. However, programming assignments will often (intentionally) not be as specfic as in lower-level classes. This will partially simulate real-world settings where one is given a data set and asked to analyze it; in such settings even less direction is provided.
For undergrads, the prerequistits are CS 3505 and CS 2100. It is also highly recommended you have taken CS 3130 - in many ways, this is the natural continuation of that course.
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.
Schedule: (subject to change - some linked material is from the previous iteration of the class)
Date Topic Video Link Assignment (latex) Project
Mon 1.06 (Instructor Traveling - No Class)
Wed 1.08 Class Overview 1,2 MMDS 1.1
Mon 1.13 Statistics Principles : Birthday Paradox + Coupon Collector 1,2 MMDS 1.2
Wed 1.15 Chernoff-Hoeffding Bounds + Applications 1,2,3 CSTIA 2.3 | Terry Tao Notes | Tarjan Notes
Mon 1.20 (MLK Day - No Class)
Wed 1.22 Similarity : Jaccard + k-Grams 1,2 MMDS 3.1 + 3.2 | CSTIA 7.3
Mon 1.27 Similarity : Min Hashing 1,2,3 MMDS 3.3
Wed 1.29 Similarity : LSH 1,2,3 MMDS 3.4 Statistical Principles
Mon 2.03 Similarity : Distances 1,2,3 MMDS 3.5 + 7.1 | CSTIA 8.1 Proposal
Wed 2.05 Similarity : SIFT and ANN vs. LSH 1,2,3 MMDS 3.7 + 7.1.3
Mon 2.10 Clustering : Hierarchical 1,2,3 MMDS 7.2 | CSTIA 8.7
Wed 2.12 Clustering : K-Means 1,2,3 MMDS 7.3 | CSTIA 8.3
Mon 2.17 (Presidents Day - No Class)
Wed 2.19 Clustering : Spectral 1,2,3 MMDS 10.4 | CSTIA 8.4 | Luxburg | Gleich Document Hash(tex)
Mon 2.24 Frequent Items : Heavy Hitters 1,2,3 MMDS 4.1 | CSTIA 7.1.3 | Min-Count Sketch | Misra-Gries Data Collection Report
Wed 2.26 Frequent Itemsets : Apriori Algorithm 1,2,3 MMDS 6+4.3 | Careful Bloom Filter Analysis
Mon 3.03 Regression : Basics in 2-dimensions 1,2,3 ESL 3.2 and 3.4
Wed 3.05 Regression : SVD + PCA 1,2,3 Geometry of SVD - Chap 3 | CSTIA 4 Clustering (tex)
Mon 3.10 (Spring Break - No Class)
Wed 3.12 (Spring Break - No Class)
Mon 3.17 Regression : Column Sampling and Frequent Directions 1,2,3 MMDS 9.4 | CSTIA 2.7 + 7.2.2 | arXiv
Wed 3.19 Regression : Compressed Sensing and OMP 1,2,3 CSTIA 10.3 | Tropp + Gilbert Intermediate Report
Mon 3.24 Regression : L1 Regression and Lasso 1,2,3 Davenport | ESL 3.8
Wed 3.26 Noise : Noise in Data 1,2,3 MMDS 9.1 | Tutorial Frequent (tex)
Mon 3.31 Noise : Privacy 1,2,3 Dwork
Wed 4.02 Graph Analysis : Markov Chains 1,2,3 MMDS 10.1 + 5.1 | CSTIA 5 | Weckesser notes
Mon 4.07 Graph Analysis : PageRank 1,2,3 MMDS 5.1 + 5.4
Wed 4.09 (room change: hill east of MEB)Graph Analysis : MapReduce 1,2,3 MMDS 2 | Old Lecture 1, 2, 3 | Overview Lecture Regression (tex)
Mon 4.14 Graph Analysis : PageRank via MapReduce 1,2,3 MMDS 5.2 Final Report
Wed 4.16 Graph Analysis : Communities 1,2,3 MMDS 10.2 + 5.5 | CSTIA 8.8 + 3.4 Poster Outline
Mon 4.21 Graph Analysis : Graph Sparsification 1,2,3 MMDS 4.1
Wed 4.23 Poster Day !!! Poster Presentation
Mon 4.29 Graphs (tex)

Grading: The grading will be 50% from homeworks and 50% from a project.

We will plan to have 5 or 6 short homework assignments, roughly covering each main topic in the class. The homeworks will usually consist of an analytical problems set, and sometimes a light programming exercize. There will be no specific programming language for the class, but some assignments may be designed around a specific one that is convenient for that task.

Each person in the class will be responsible for a small project. I will allow small groups to work together. The project will be very open-ended; basically it will consist of finding an interesting data set, exploring it with one or more techniques from class, and presenting what you found. I will try to provide suggestions for data sources and topics, but ultimately the groups will need to decide on their own topic. There will be several intermediate deadlines so projects are not rushed at the end of the semester.

Late Policy: To get full credit for an assignment, it must be turned in through Canvas by the start of class, specifically 5pm. Once the 5pm deadline is missed, those turned in late will lose 10%. Every subsequent 24 hours until it is turned another 10% is deducted. That is, a homework 30 hours late worth 10 points will have lost 2 points. Once the graded assignment is returned, or one week has passed, any assignment not yet turned in will be given a 0.

Cheating Policy: The Utah School of Computing has a Cheating Policy which requires all registered students to sign an Acknowledgement Form. This form must be signed and turned into the department office before any homeworks are graded.

This class has the following collaboration policy. For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. For projects, you may of course work however you like within your groups. You may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged (this does not count towards page limits). Of course any outside materials used must be referenced appropriately.

Latex: I highly highly recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures.

Discussion Group: