I have a list of tuples like with a hash and path to a file. I would like to find all duplicates as well as similar items based on hamming-distance. I have a function for haming distance score where I give to values and get the score.
I stuck with the problem to loop through the list and find matching items.
list = [('94ff39ad', '/path/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]
score = haming_score(h1, h2)
# score_for_similar > 0.4
I need a dictionary with an original (path) as key and a list of possible similar or duplicates as value.
like:
result = {'/path/to/file.jpg': ['/path/to/file2.jpg', '/path/to/file3.jpg'], '/path/to/file4.jpg': []}
The second dict key value pair {'/path/to/'file4.jpg': []} is not necessary but helpful to have.
Currently I loop twice through the list and compare the values with each other. But I get double results.
I would be very greateful for your help.
P.S. to calculate the hamming-distance score I use:
def hamming_dist(h1, h2):
h1 = list(h1)
h2 = list(h2)
score = scipy.spatial.distance.hamming(h1, h2)
return score