I need to compute the Jaccard similarity of all pairs of lines of text. I will want in output only the pairs with a similarity higher than 80%. In the past I've studied Hadoop MapReduce framework and that is how I would solve this with map and reduce functions:
map(lineID, text):
for each word in text:
emit(word, (len(text), lineID))
reduce(word, list(v)):
if len(list(v)) < 2:
do nothing
else
for each pair ((len1, 1), (len2, 2)):
emit ((1, 2, len, len2), 1)
map(k, v):
emit (k, v)
reduce(k, v):
similarity = len(v)/(k[2]+k[3]-len(v))
if similarity > 0.80
emit((k[0], k[1]), similarity)
Now I need to implement this pseudocode in PySpark, but I am a little bit stuck. All I managed to do is the first map, like:
def mapping(line):
length = len(line.split())-1
jobID = line.split()[0]
return (length, jobID)
c = textFile.map(lambda line: [(c, (mapping(line))) for c in line.split()[1:]])
I am not considering the first word because that word is the lineID. This is another doubt I have, how to get the indexes of the lines of the input text? How are the tasks assigned to the workers? I am very confused about the way Apache Spark works.
Do you have any suggestion about which methods could I use, and in which order for achieving the result I get in MapReduce?