8

I need to compare a large number of strings similar to 50358c591cef4d76. I have a Hamming distance function (using pHash) I can use. How do I do this efficiently? My pseudocode would be:

For each string
    currentstring= string
    For each string other than currentstring
        Calculate Hamming distance

I'd like to output the results as a matrix and be able to retrieve values. I'd also like to run it via Hadoop Streaming!

Any pointers are gratefully received.

Here is what i have tried but it is slow:

import glob
path = lotsdir + '*.*'
files = glob.glob(path)
files.sort()
setOfFiles = set(files)
print len(setOfFiles)
i=0
j=0
for fname in files:
    print 'fname',fname, 'setOfFiles', len(setOfFiles)
    oneLessSetOfFiles=setOfFiles
    oneLessSetOfFiles.remove(fname)
    i+=1

    for compareFile in oneLessSetOfFiles:
        j+=1
        hash1 = pHash.imagehash( fname )
        hash2 = pHash.imagehash( compareFile)
        print ...     
schoon
  • 2,858
  • 3
  • 46
  • 78
  • If you want to compare each string with each string you will have two nested loops. Is that what you want to do? –  Jul 04 '14 at 10:31
  • You can use the [ceja](https://github.com/mrpowers/ceja) library to leverage the power of PySpark and apply the hamming distance algorithm on billions of rows of data. – Powers Jul 08 '20 at 21:48

1 Answers1

8

The distance package in python provides a hamming distance calculator:

import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

There is more information and example code for all of these on this old question.

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
    (hamming(first[0], second[0]), first, second)
    for second in files
    for first in files
    if first[1] != second[1]
]

This process fundamentally involves O(N^2) comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B blocks where B^2 = M (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)
Community
  • 1
  • 1
Matthew Franglen
  • 4,441
  • 22
  • 32
  • Thanks for your help but I already have a Hamming Distance calculator. I have move my hashing to outside the loop as I was doing it far too many times. – schoon Jul 04 '14 at 12:43
  • I have updated my answer. You are right that hashing within the loop is too slow. – Matthew Franglen Jul 04 '14 at 14:27
  • Link to http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison is broken – codebox Oct 20 '15 at 06:22
  • No doubt the [fun police](http://blog.stackoverflow.com/2010/01/stack-overflow-where-we-hate-fun/) have had their say. A pity. The only link that was referenced that lacks examples is the levenshtein one. You can read examples for that here: http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html – Matthew Franglen Oct 20 '15 at 19:05