Models of Computation for Massive Data

Models of Computation for Massive Data
Instructor : Jeff Phillips | Office house: Mondays, Wednesdays 12:05 pm - 1:05 pm @ MEB 3147 or MEB 3442
TA: Bigyan Mukherjee | Office hours: Mondays, Wednesdays 3-4pm @ MEB 3115.
Fall 2011 | Mondays, Wednesdays 10:45 am - 12:05 pm
MEB 3147
Catalog number: CS 7960 01

Description:
This course will explore advanced models of computation pertinent for processing massive data sets. As data sets grow to terabyte and petabyte scales, traditional models and paradigms of sequential computation become obsolete. Different efficiency trade-offs analyzing memory usage, I/O calls, or inter-node communication become the dominant bottlenecks. These paradigms are formalized as I/O-Efficient, Parallel, Streaming, GPU-based, Map-Reduce, and other distributed algorithmic models of computation. This course will study the history and specifics of these models. Students in the class will learn the proper settings in which to use these paradigms, the advantages and disadvantages of each model, and how to analyze algorithms with these settings. They will be evaluated on both analysis problem sets and basic programming assignments within these models.

Schedule: (subject to change)

Date	Topic	Link	Assignment
Mon 8.22	Class Overview
Wed 8.24	Review of Sequential Model
Mon 8.29	I/O : Overview and History of Model	Arge Class \| Zeh class and links \| Vitter book
Wed 8.31	I/O : Sorting	Aggarwal + Vitter \| Arge notes	Implement Merge Sort
Mon 9.05	(Labor Day - No Class)
Wed 9.07	I/O : Searching with B-Trees	Arge notes
Mon 9.12	I/O : Extensions (Cache Oblivious, Parallel External)	CacheObliv \| Parallel EM \| Parallel Disk
Wed 9.14	Streaming : Overview and History of Model + Reservoir Sampling	Muthu Notes \| Chakrabarti Notes	Streaming Problem Set out
Mon 9.19	Streaming : Distinct Element Counting	Chapter 2 \| Randomized Algorithm Book (alternate)
Wed 9.21	Streaming : Heavy-Hitters + Reservoir Sampling	Chapter 1 \| MJRTY \| Reservoir Sampling
Mon 9.26	Streaming : Count-Min	Chapter 3 \| Count-Min
Wed 9.28	Streaming : Approximate Rank	Chapter 8	Merge Sort due
Mon 10.03	Parallel : Overview and History of Model	Vishkin on PRAM \| on BSP
Wed 10.05	Parallel : Prefix Sum and Scanning \| Discuss EC2 + Project	Vishkin Chap 3	Streaming Problem Set due
Mon 10.10	(Fall Break - No Class)
Wed 10.12	(Fall Break - No Class)
Mon 10.17	Parallel : Merging and Sorting	Vishkin Chap 4
Wed 10.19	Parallel : Selection	Vishkin Chap 5
Mon 10.24	MapReduce : Overview and History of Model	Slides by Leskovec \| Wired Article	Anomalous WordCount out
Wed 10.26	MapReduce : Examples: Matrix Multiply + Page Rank	Ullman notes
Mon 10.31	MapReduce : Formalizing the Model	Goodrich model \| MRC model \| MUD model
Wed 11.02	MapReduce : Counting Triangles	SV paper \| SV Slides
Mon 11.07	MapReduce : Filtering	Filtering paper
Wed 11.09	GPU : Overview and History of Model	Desbrun: Histroy of GPU \| CUDA tutorial
Mon 11.14	GPU : Sorting	Kider on sorting \| Hybrid sort \| GPGPU Survey
Wed 11.16	GPU : Multi-BSP for GPUs	Multi-BSP	Anomalous WordCount due
Mon 11.21	Distributed Computing : Overview and History of Model
Wed 11.23	Distributed Computing : Peer-to-Peer Networks	PASTRY \| CHORD \| Wikipedia
Mon 11.28	Distributed Computing : Mergeable Summaries	Mergeable Summaries paper
Wed 11.30	(Moved to Dec 9 - No Class)
Mon 12.05	Project Presentations		Project Report Due
Wed 12.07	Project Presentations
Fri 12.09	Project Presentations

Grading:
The grades will be based on the homework assignments (50%) and a project (50%).

Project:
A major component of this class will be a project where you will investigate in-depth a focused topic in a particular model or the relation between several models. Details.

FAQ:
Q: Will there be a book?
A: No. Most information is available online and I will post links on the (under-construction) webpage.
As far as I know, this class has not been taught before. Different aspects have, but not as a whole. For instance there is a new book on MapReduce that will definitely help guide that section of this course.

Q: How hands on will the class be?
A: My current aim for the class lecture material is roughly 50% analysis and 50% systems background. But the work for the course will be able to be tailored more towards each student's preference. This is all subject to change, but the plan is half the work will be a series of short assignments which will be half analysis and half small implementation projects. About one for each model we cover. I also plan for there to be a project due at the end of the term, constituting about half the course's workload. This could be more hands on, or could be purely analysis-based. So each student could range from 25% analysis, 75% hands-on to 75% analysis, 25% hands-on.
So, the plan is that everyone will get hands-on experiences working with some/most of the models, and some who choose to go that direction will get their hands much more dirty :).

Q: Will we get to experiment with Hadoop and MapReduce?
A: Yes. The class has been graciously sponsered by Amazon Web Services. Each student will be able to experiement and use Amazon's Cloud computing services. As part of the class each student will set up an account and experiment with Hadoop (an open-source MapReduce code base) installed and maintained on Amazon's servers. Their installation should be very robust as it is used by many real businesses, and I expect this to be a very rewarding experience for the students in the class.

Q: I took undergraduate algorithms, but not graduate algorithms. Will I be lost in the analysis?
A: My intention is "no." I expect you to understand big-O notation and how to analyze basic algorithms in it. But this class will not be as intense or extensive in its analysis as CS 6150 (Advanced Algorithms). We will apply big-O analysis to analyze aspects other than just run-time, and the goal of this course will be to teach you how and when to do that.

Q: I am having trouble registering. What is the full course number/information?
A: The official course catalog number is CS 7960-001, or as class number 17020.

Q: I am still having trouble registering, the class if overfull.
A: I have reset the class cap to 35.