\documentclass[11pt]{article}
\usepackage{classDM14}
\usepackage{hyperref}
\title{Asmt 2: Document Similarity and Hashing}
\author{Turn in through Canvas by 5pm, then come to class: \\
Wednesday, February 19 \\
20 points}
\date{}
%\newcommand{\JS}{\ensuremath{\textsf{\small JS}}}
%\newcommand{\D}{\texttt{\textbf{d}}}
\begin{document}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Overview}
In this assignment you will explore the use of $k$-grams, Jaccard distance, min hashing, and LSH in the context of document similarity.
You will use four text documents for this assignment:
\begin{itemize} \denselist
\item \href{http://www.cs.utah.edu/~jeffp/teaching/cs5140/A2/D1.txt}{\texttt{http://www.cs.utah.edu/\~{}jeffp/teaching/cs5140/A2/D1.txt}}
\item \href{http://www.cs.utah.edu/~jeffp/teaching/cs5140/A2/D2.txt}{\texttt{http://www.cs.utah.edu/\~{}jeffp/teaching/cs5140/A2/D2.txt}}
\item \href{http://www.cs.utah.edu/~jeffp/teaching/cs5140/A2/D3.txt}{\texttt{http://www.cs.utah.edu/\~{}jeffp/teaching/cs5140/A2/D3.txt}}
\item \href{http://www.cs.utah.edu/~jeffp/teaching/cs5140/A2/D4.txt}{\texttt{http://www.cs.utah.edu/\~{}jeffp/teaching/cs5140/A2/D4.txt}}
\end{itemize}
\vspace{.1in}
\emph{As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
\url{http://www.cs.utah.edu/~jeffp/teaching/latex/}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Creating $k$-Grams (8 points)}
You will construct several types of $k$-grams for all documents. All documents only have at most 27 characters: all lower case letters and space.
{\color{blue} Unfortunately, the documents have a few extra characters (e.g. 2,1,0,8, newline), please ignore them. Sorry.}
\begin{itemize} \denselist
\item[\s{[G1]}] Construct $2$-grams based on characters, for all documents.
\item[\s{[G2]}] Construct $3$-grams based on characters, for all documents.
\item[\s{[G3]}] Construct $3$-grams based on words, for all documents.
\end{itemize}
Remember, that you should only store each $k$-gram once, duplicates are ignored.
\paragraph{A: (4 points)} How many distinct $k$-grams are there for each document with each type of $k$-gram? You should report $4 \times 3 = 12$ different numbers.
\paragraph{B: (4 points)} Compute the Jaccard distance between all pairs of documents for each type of $k$-gram. You should report $3 \times 6 = 18$ different numbers.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Min Hashing (6 points)}
We will consider a hash family $\Eu{H}$ so that any hash function $h \in \Eu{H}$ maps from $h : \{\s{$k$-grams}\} \to [m]$ for $m$ large enough (I suggest over $m \geq 10{,}000$).
\paragraph{A: (5 points)} Using grams \s{G2}, build a min-hash signature for document \texttt{D1} and \texttt{D2} using $t = \{10, 50, 100, 250, 500\}$ hash functions. For each value of $t$ report the Hamming similarity between the pair of documents \texttt{D1} and \texttt{D2}, estimating the Jaccard similairty:
\[
\Ham(a,b) = \frac{1}{t} \sum_{i=1}^t \begin{cases} 1 & \textsf{if } a_i = b_i \\ 0 & \textsf{if } a_i \neq b_i. \end{cases}
\]
You should report $5$ numbers.
\paragraph{B: (1 point)} What seems to be a good value for $t$? You may run more experiments. Justify your answer in terms of both accuracy and time.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{LSH (6 points)}
Consider computing an LSH using $m = 100$ hash functions. We want to find all documents which have Jaccard similarity above $\tau = .3$.
\paragraph{A: (4 points)}
Use the trick mentioned in class and the notes to estimate the best values of rows $r$ in each of $b$ blocks to provide the S-curve
\[
S(s) = 1- (1-s^r)^b
\]
with good separation at $\tau$. Report these values.
\paragraph{B: (2 points)}
Using your choice of $r$ and $b$ and $S(\cdot)$, what is the probability \st{that you will need to check the exact Jaccard similarity} of each pair of the four documents using \s{G2} for {\color{blue} being estimated for} having similarity greater that $\tau$?
Report $6$ numbers.
\emph{(Show your work.)}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bonus (3 points)}
Describe a scheme like Min-Hashing for the \emph{Andberg Similarity}, defined $\Andb(A,B) = \frac{|A \cap B|}{|A \cup B| + |A \triangle B|}$. So given two sets $A$ and $B$ and family of hash functions, then $\Pr_{h \in \Eu{H}}[h(A) = h(B)] = \Andb(A,B)$. Note the only randomness is in the choice of hash function $h$ from the set $\Eu{H}$, and $h \in \Eu{H}$ represents the process of choosing a hash function (randomly) from $\Eu{H}$. The point of this question is to design this process, and show that it has the required property.
Or show that such a process cannot be done.
\end{document}