Jaccard Similarity between lines of text Apache Spark

Question

I need to compute the Jaccard similarity of all pairs of lines of text. I will want in output only the pairs with a similarity higher than 80%. In the past I've studied Hadoop MapReduce framework and that is how I would solve this with map and reduce functions:

map(lineID, text):
    for each word in text:
         emit(word, (len(text), lineID))

reduce(word, list(v)):
    if len(list(v)) < 2:
        do nothing
    else 
        for each pair ((len1, 1), (len2, 2)):
             emit ((1, 2, len, len2), 1)

map(k, v):
    emit (k, v)

reduce(k, v):
    similarity = len(v)/(k[2]+k[3]-len(v))
    if similarity > 0.80
         emit((k[0], k[1]), similarity)

Now I need to implement this pseudocode in PySpark, but I am a little bit stuck. All I managed to do is the first map, like:

def mapping(line):
    length = len(line.split())-1
    jobID = line.split()[0]
    return (length, jobID)

c = textFile.map(lambda line: [(c, (mapping(line))) for c in line.split()[1:]])

I am not considering the first word because that word is the lineID. This is another doubt I have, how to get the indexes of the lines of the input text? How are the tasks assigned to the workers? I am very confused about the way Apache Spark works.

Do you have any suggestion about which methods could I use, and in which order for achieving the result I get in MapReduce?

[Efficient string matching in Apache Spark](https://stackoverflow.com/q/43938672) — zero323, Nov 05 '18 at 11:50

score 1 · Answer 1 · answered Nov 05 '18 at 07:15

Unless your data is very large, the simplest and easiest approach may also be the fastest. Let's divide and conquer the problem:

Get a dataframe of all pairs of lines using crossJoin.
Remove the rows where the left hand line is the same as the right hand line as you don't care about self comparisons.
Use a simple UDF jaccard(left, right) to return the Jaccard similarity.
Filter by similarity > 0.8

I use Spark through Scala so I'll give you the Scala code for this; the Python DSL should be very similar.

val lines = spark.read.text(...)
lines.alias("lhs").crossJoin(lines.alias("rhs"))
  .where($"lhs.value" =!= $"rhs.value")
  .withColumn("similarity", jaccard($"lhs.value", $"rhs.value"))
  .where($"similarity" > 0.8)

Jaccard Similarity between lines of text Apache Spark

1 Answers1