Foundations of Data Analysis
Instructor : Jeff Phillips (email) | Office hours: Thursdays 10am @ Zoom (and directly after class on Zoom)
TAs: Hasan Pourmahmood (email) | Office hours: Monday 11am - 1pm (Zoom).
         Peter Jacobs (email) | Office hours: Tuesday 8:30 - 10:30am (Zoom).
Fall 2020 | Mondays, Wednesdays 1:25 pm - 2:45 pm
Catalog number: CS 3190 01
Google Calendar of all lectures & office hours


This class will be an introduction to computational data analysis, focusing on the mathematical foundations, but providing some basic experience in analysis techniques. The goal will be to carefully develop and explore several core topics that form the backbone of modern data analysis topics, including Machine Learning, Data Mining, Artificial Intelligence, and Visualization. This will include some background in probability and linear algebra, and then various topics including Bayes Rule and its connection to inference, linear regression and its polynomial and high dimensional extensions, principal component analysis and dimensionality reduction, as well as classification and clustering. We will also focus on modern PAC (probably approximately correct) and cross-validation models for algorithm evaluation.
Some of these topics are often very breifly covered at the end of a probability or linear algebra class, and then are often assumed knowledge in advanced data mining or machine learning classes. This class fills that gap. The planned pace will be closer to CS3130 or Math2270 than the 5000-level advanced data analysis courses.

We will use Python in the class to demonstrate and explore basic concepts. But programming will not be the main focus.
TA Hasan Poormahmood created a short python tutorial on loading, manipulating, processing, and plotting data in python in colab. Here is the python notebook so you can follow along.

Book: Mathematical Foundations of Data Analysis (v0.6)
This is a draft of a book I started writing in Fall 2016 for this course -- a hard copy (with a few small updates & improvements) may be available for purchase before the semester ends.
More outside online resources are listed below.

Videos: Lecture will be given live and interactive on Zoom.
We will also live stream on YouTube (15-20 second delay). The videos will archive on YouTube in this YouTube playlist.

The official pre-requisites are CS 2100, CS 2420, and Math 2270. These are to ensure a certain very basic mathematical maturity (CS 2100) a basic understanding of how to store and manipulate data with some efficiency (CS2420), and basics of linear algebra and high dimensions (MATH 2270).
We have as a co-requisite CS 3130 (or Math 3070) to ensure some familiarity with probability.
A few lectures will be devoted to review linear algebra and probability, but at a fast pace and a focus on the data interpretation of these domains.
This class will soon become a pre-requisite for CS 5350 (Machine Learning) and CS 5140 (Data Mining), as part of a new Data Science pipeline.

Date Chapter Video Topic Assignment
Mon 8.24 YT Class Overview
Wed 8.26 Ch 1 - 1.2 YT Probability Review : Sample Space, Random Variables, Independence (colab) HW1 out
Mon 8.31 Ch 1.3 - 1.6 YT Probability Review : PDFs, CDFs, Expectation, Variance, Joint and Marginal Distributions
Quiz 0
Wed 9.02 Ch 1.7 YT Bayes' Rule: MLEs and Log-likelihoods
Mon 9.07
Wed 9.09 Ch 1.8 YT Bayes Rule : Bayesian Reasoning
Mon 9.14 Ch 2.1 - 2.2 YT Convergence : Central Limit Theorem and Estimation (colab)
Quiz 1
Wed 9.16 Ch 2.3 YT Convergence : PAC Algorithms and Concentration of Measure HW 1 due
Mon 9.21 Ch 3.1 - 3.2 YT Linear Algebra Review : Vectors, Matrices, Multiplication and Scaling
Wed 9.23 Ch 3.3 - 3.5 YT Linear Algebra Review : Norms, Linear Independence, Rank and numpy (colab) HW 2 out
Mon 9.28 Ch 3.6 - 3.8 YT Linear Algebra Review : Inverse, Orthogonality
Quiz 2
Wed 9.30 Ch 5.1 YT Linear Regression : explanatory & dependent variables (colab)
Mon 10.05 Ch 5.2-5.3 YT Linear Regression : multiple regression (colab), polynomial regression (colab)
Wed 10.07 Ch 5.4 YT Linear Regression : overfitting and cross-validation (colab) HW 2 due
Mon 10.12 Ch 5 YT Linear Regression : mini review + slack
Quiz 3
Wed 10.14 Ch 6.1 - 6.2 YT Gradient Descent : functions, minimum, maximum, convexity & gradients HW 3 out
Mon 10.19 Ch 6.3 YT Gradient Descent : algorithmic & convergence (colab)
Wed 10.21 Ch 6.4 YT Gradient Descent : fitting models to data and stochastic gradient descent
Mon 10.26 Ch 7.1 - 7.2 YT Dimensionality Reduction : SVD
Quiz 4
Wed 10.28 Ch 7.2 - 7.3 YT Dimensionality Reduction : rank-k approximation and eigenvalues (colab) HW 3 due
Mon 11.02 Ch 7.4 YT Dimensionality Reduction : power method (colab) HW 4 out
Wed 11.04
Election Day Break (practice quiz questions)
Mon 11.09 Ch 7.5 - 7.6 YT Dimensionality Reduction : PCA, centering (colab), and MDS (colab)
Wed 11.11 Ch 8.1 YT Clustering : Voronoi Diagrams + Assignment-based Clustering (colab)
Mon 11.16 Ch 8.3 YT Clustering : k-means
Quiz 5
Wed 11.18 Ch 8.4, 8.7 YT Clustering : EM, Mixture of Gaussians, Mean-Shift HW 4 due
Mon 11.23 Ch 9.1 YT Classification : Linear prediction
Wed 11.25 Ch 9.2 YT Classification : Perceptron Algorithm HW 5 out
Mon 11.30 Ch 9.3 YT Classification : Kernels and SVMs
Quiz 6
Wed 12.02 Ch 9.4 - 9.5 YT Classification : Neural Nets
Fri 12.04 HW 5 due
Mon 12.07 YT Semester Review (starting @ 2:15pm)
Fri 12.11
FINAL EXAM overlaps with (1:00pm - 3:00pm)

Class Organization: The class will be run through this webpage, Canvas, and Zoom/YouTube. The schedule, notes, and links will be maintained here. All homeworks will be turned in through Canvas.

Grading: There will be one final exam with 20% of the grade. Homeworks will be worth 60% of the grade. There will be 5 homeworks and the lowest one can be dropped. Quizzes will be worth 20% of the grade. They will be timed Canvas quizzes. There will be 6 or 7 (the first, Quiz 0, is worth fewer points).

The homeworks will usually consist of an analytical problems set, and sometimes light programming exercizes in python. When python will be used, we typically will work through examples in class first.

Late Policy: To get full credit for an assignment, it must be turned in through Canvas by the start of class, specifically 1:10. Once the 1:10 deadline is missed, those turned in late will lose 10%. Every subsequent 24 hours until it is turned another 10% is deducted. That is, a homework 30 hours late worth 10 points will have lost 2 points. Once the graded assignment is returned, or 48 hours has passed, any assignment not yet turned in will be given a 0.

Academic Conduct Policy: The Utah School of Computing has an academic misconduct policy, which requires all registered students to sign an Acknowledgement Form. This form must be signed and turned into the department office before any homeworks are graded.

This class has the following collaboration policy:
For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collabodated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment.

More Resources:
I hope the book provide all information required to understand the material for the class .. and for a solid footing beyond. However, it is sometimes useful to also explore other sources.
Wikipedia is often a good source on many of these topics. In the past students have also enjoyed 3 Blue 1 Brown.

Here are a few other books that cover some of the material, but at a more advanced level:
Understanding ML | Foundations of Data Science | Introduction to Statistical Learning

Here is a list nice resources I believe may be useful with relevant parts at roughly the right level for this course, but often with disparate notation:
  • Probability: ProbStat course | P1 | P2
  • Bayes Rule/Reasoning: B1 | B2 | B3 | B4
  • Linear Algebra: No-BS Book | LA1 | LA2 | LA3
  • Linear Regression: LR1 | LR2
  • Gradient Descent: GD1 | GD2
  • PCA: PCA1 | PCA2 | PCA3 | PCA4
  • Clustering: C1 | C2 | C3 | C4
  • Classification: L1 | L2 | L3