I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:
For each string
currentstring= string
For each string other than currentstring
Calculate Hamming distance
I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!
Any pointers are gratefully received.
Here is what i have tried but it is slow:
import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
print 'fname',fname, 'setOfFiles', len(setOfFiles)
oneLessSetOfFiles=setOfFiles
oneLessSetOfFiles.remove(fname)
i+=1
for compareFile in oneLessSetOfFiles:
j+=1
hash1 = pHash.imagehash( fname )
hash2 = pHash.imagehash( compareFile)
print ...