I'm attempting to implement locality-sensitive hashing in PySpark (based on the spark-hash project, written in Scala). The hashing step is generating some strange behavior.
In a step where I take the hash of the list of minhashes generated for each vector, the output of this appears to depend greatly on whether I'm hashing in parallel (PySpark REPL) or in sequence (after collect
). For instance, if I generate the hashes this way (the call to groupByKey
should give me elements that hash to the same band):
bands = model.signatures.groupByKey().collect()
hashes = [hash(band[1]) for band in bands]
I get a list that resembles what you would expect; namely, lots of unique numbers:
278023609,
278023657,
278023621,
278023449,
278023593,
278023589,
278023529,
278023637,
278023673,
278023429,
278023441,
...
However, I take that exact same data but hash it using the Spark constructs:
hashes = model.signatures.groupByKey().map(lambda x: hash(x[1])).collect()
Now I get a list that looks like this:
286120785,
286120785,
286120785,
286120785,
286120785,
286120785,
286120785,
286120785,
...
Same hash repeating over and over. If, however, I use the same Spark constructs, but at the last second cast the ResultIterable to a frozenset:
hashes = model.signatures.groupByKey().map(lambda x: hash(frozenset(x[1].data))).collect()
Now I get a list of unique hashes again. Any idea what's going on? Is there something strange about how hashing works on ResultIterable
objects during Spark execution?