CS 6230: Parallel Computing and High Performance Computing

## Assignment 2 ### Due 11 March 2015 Please submit a single pdf and a single tgz named uid.{pdf,tgz} respectively. Let the uid start with a lowercase u followed by 7 digits. 0. Code sample I have checked in a hybrid code sample into the course github page. Look at the example under the hybrid folder. You can compile this on Stampede using the intel compiler as,

bash
 mpicxx -o reduce -openmp reduce.cpp

if you wish to use gcc, replace -openmp with -fopenmp. 1. Given \(x_0\) and \(\{a_i,b_i\}_{i=1}^{n-1}\), we seek to compute \[ x_i = a_ix_{i-1} + b_i, \quad\quad i=1,\ldots,n-1 \] An obvious sequential algorithm for this problem has \(\mathcal{O}(n)\) complexity. Give a work-depth language algorithm for this problem that has \(\mathcal{O}(\log n)\) depth. In addition, the algorithm should be work-optimal. 2. The problem develops a simple optimal \(\mathcal{O}(\log n)\) time EREW PRAM algorithm for merging two sorted sequences \(A\) and \(B\) , each of length \(n\). We start by noting that the bitonic sorting network provides an EREW PRAM merging algorithm that runs in \(\mathcal{O}(\log n)\) time, but also uses \(\mathcal{O}(n\log n)\) operations. To make this algorithm optimal, we start by partitioning \(A\) and \(B\) into \(\frac{n}{\log n}\) blocks of almost equal lengths. and follow the general idea of the optimal merging algorithm described in class. Device a work-optimal bitonic merge algorithm for EREW PRAM and give the pseudocode. 3. Using the work-depth language give an algorithm for computing the median of an unsorted array of n integers. What is the work and depth of your algorithm? Is your algorithm work optimal? Suggest how you would find the \(k\)-ranked element in an unsorted array (hint: the median is the \(k=n/2\)-ranked element). 4. Generalize the bitonic sort algorithm discussed in class. Specifically, describe how you would handle the following two cases, * \(p \neq 2^k\) * number of elements on processes are not the same. How should the bitonic merge be modified to handle this case correctly. Give the overall pseudocode. 5. Programming: Implement an MPI+OpenMP+SIMD parallel sort routine that sorts distributed integers. You will implement different algorithms for each of the levels of parallelism. This is going to be a touch programming tasks, so manage your time well. My suggestion is to allocate 1 week to each level. Although, the SIMD level should not take long, you should use the extra time to run experiments and optimize your code. Let us look at each of these levels and the associated algorithms. * SamleSort + MPI: You should implement samplesort as your distributed sort example. The pseudocode for samplesort is given below. * mergesort + OpenMP: The second part will be for you to implement a parallel merge sort using OpenMP. At the base level divide the array into num_threads chunks and sort them sequentially, say using std::sort. Now merge these num_threads sorted arrays using mergesort. * merge + SIMD: During mergesort, we will merge 2 arrays at a time. This operation can be vectorized. This operation is going to look like the bitonic merge operation. Refer to this paper for more details. You should implement this using AVX and test it on Stampede.

c
 int samplesort(int* array, int n, MPI_Comm comm) {
   // local sort
   std::sort(array); // replace with your omp_mergesort()
   // pick samples
   select = sample(p-1, array);
   
   // Gather samples at root
   MPI_Gather(select, p-1. MPI_INT, all_samples, p-1, MPI_INT, root, comm);
   
   // Sort samples
   std::sort(all_samples); // replace with your omp_mergesort()
   
   // pick splitters
   splitters = sample(p-1, splitters);
   
   // Broadcast splitters
   MPI_Bcast(splitters, p-1, MPI_INT, root, comm);
   
   // Bin data into buckets
   
   //-- compute how many elements we put in each bucket
   // lets call these send_cnts
   
   // exchange data
   int  *send_cnts, *recv_cnts;// p integers
   int  *send_offsets, *recv_offsets;// p integers
   
   int  *send_buf, *recv_buf;
   int  total_recv;
   
   // exchange counts
   MPI_Alltoall(send_cnts,1, MPI_INT, recv_cnts,1, MPI_INT, comm);
   
   // compute offsets ...
   send_offsets[0]=0;
   recv_offsets[0]=0;
   for  (int  i=1; i<p; ++i) {
     send_offsets[i] = send_offsets[i-1] + send_cnts[i-1];
     recv_offsets[i] = recv_offsets[i-1] + recv_cnts[i-1];
   }
   total_recv = recv_offsets[p-1] + recv_cnts[p-1];
   
   // allocate for recv_buffer
   recv_buf = (int  *)malloc  (sizeof(int) * total_recv );
   
   // do the actual all2allv
   
   MPI_Alltoallv (send_buf, send_cnts, send_offsets, MPI_INT,
   recv_buf, recv_cnts, recv_offsets, MPI_INT, comm); 
   
   // final local sort ...
   std::sort(array); // replace with your omp_mergesort()
 }