Data Mining
Instructor : Jeff Phillips | Office hours: Thursdays 1-2pm @ MEB 3442 (and often directly after class)
TA: Mingwang Tang | Office hours: Tuesdays 10-11am @ MEB 4158
Spring 2012 | Mondays, Wednesdays 1:25 pm - 2:45 pm
MEB 3105
Catalog number: CS 5955 01 (ugrad) or CS 6955 01 (grad)

Data mining is the study of efficiently finding structures and patterns in data sets. We will also study what structures and patterns you can not find. The structure and patterns are based on statistical and probabilistic principals, and they are found efficiently through the use of clevel algorithms. This class will take this two-pronged approach to the topic; we will understand the model and then explore efficient algorithms to find them.
Topics will include: similarity search, clustering, regression/dimensionality reduction, anomaly detection, link analysis (PageRank), and small space summaries.

Book: We will follow, when relevant, Mining Massive Data Sets (MMDS) by Anand Rajaraman and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
When material is not covered by the book, free reference material will be linked to or produced.

Prerequisits: A student who is comfortable with basic probability, basic big-O analysis, and simple programming will be qualified for the class. For undergrads, the prerequistits are CS 3505 and CS 2100. Although this is planned to become a regular class, because it is the first year it will be taught, undergrads will need instructor permision. Please email the instructor to recieve a code.
Schedule: (subject to change)
Date Topic Link Assignment (latex) Project
Mon 1.09 Class Overview MMDS 1.1
Wed 1.11 Clustering: MetaClustering | Guest Lecture by Parasaran Raman
Mon 1.16 (MLK Day - No Class)
Wed 1.18 MapReduce | Guest Lecture by Jeffrey Jestes MMDS 2.2
Mon 1.23 Statistics Principals : Birthday Paradox + Coupon Collector MMDS 1.2
Wed 1.25 Statistics Principals : Uniform Samples and Tail Bounds Pak notes
Mon 1.30 Similarity : Jaccard + Shingling MMDS 3.1 + 3.2
Wed 2.01 Similarity : Min Hashing MMDS 3.3 Statistical Principals
Fri 2.03 Probability Review
Mon 2.06 Similairty : LSH MMDS 3.4 Proposal
Wed 2.08 Similarity : Distances MMDS 3.5
Mon 2.13 Similarity : SIFT and ANN vs. LSH MMDS 3.7 + 7.1.3 Similarities (on Gradiance)
Wed 2.15 Clustering : Hierarchical MMDS 7.2
Mon 2.20 (Presidents Day - No Class)
Wed 2.22 Clustering : K-Means MMDS 7.3 Document Hashing
Mon 2.27 Clustering : Spectral MMDS 10.4 + 11.1 | Luxburg Tutorial Data Collection Report
Wed 2.29 Regression : Basics in 2-dimensions MMDS 11.2 | ESL 3.2 and 3.4
Mon 3.05 Regression : PCA and MDS MMDS 11.3 | Geometry of SVD - Chap 3
Wed 3.07 Regression : Random Projections (JL) and column sampling MMDS 11.4 | Proof | column sampling Clustering
Mon 3.12 (Spring Break - No Class)
Wed 3.14 (Spring Break - No Class)
Mon 3.19 Regression : Compressed Sensing and OMP Tropp + Gilbert
Wed 3.21 Regression : L1 Regression and Lasso Davenport | ESL 3.8 Intermediate Report
Mon 3.26 Anomaly Detection : Outliers + Heavy Tails + Uncertainty MMDS 9.1 + Tutorial
Wed 3.28 Anomaly Detection : Heavy Hitters MMDS 4.1 | Min-Count Sketch | Misra-Gries
Mon 4.02 Anomaly Detection : Privacy Dwork
Wed 4.04 Link Analysis : Markov Chains MMDS 10.1 + 5.1| Weckesser notes Regression
Mon 4.09 Link Analysis : PageRank MMDS 5.1 + 5.4 Final Report
Wed 4.11 Link Analysis : PageRank via MapReduce MMDS 5.2
Mon 4.16 Link Analysis : Communities MMDS 10.2 + 5.5
Wed 4.18 Summaries : Graph Sparsification MMDS 4.1 Poster Outline
Mon 4.23 Summaries : Bloom Filters and Quantiles MMDS 4.3 | Careful Bloom Filter Analysis
Wed 4.25 Poster Day !!! Poster Presentation
Mon 4.30 Markov Chains

Grading: The grading will be 50% from homeworks and 50% from a project.

We will plan to have 5 or 6 short homework assignments, roughly covering each main topic in the class. The homeworks will usually consist of an analytical problems set, and sometimes a light programming exercize. There will be no specific programming language for the class, but some assignments may be designed around a specific one that is convenient for that task.

Each person in the class will be responsible for a small project. I will allow small groups to work together. The project will be very open-ended; basically it will consist of finding an interesting data set, exploring it with one or more techniques from class, and presenting what you found. I will try to provide suggestions for data sources and topics, but ultimately the groups will need to decide on their own topic. There will be several intermediate deadlines so projects are not rushed at the end of the semester.

Cheating Policy: The Utah School of Computing has a Cheating Policy which requires all registered students to sign an Acknowledgement Form. This form must be signed and turned into the department office before any homeworks are graded.

This class has the following collaboration policy. For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. For projects, you may of course work however you like within your groups. You may discuss your project with anyone as well, but if this contributes to your final product, they must be acknoledged (this does not count towards page limits). Of course any outside materials used must be referenced appropriately.

Latex: I highly highly recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures.