2

I have a list of tuples like with a hash and path to a file. I would like to find all duplicates as well as similar items based on hamming-distance. I have a function for haming distance score where I give to values and get the score.

I stuck with the problem to loop through the list and find matching items.

list = [('94ff39ad', '/path/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]
score = haming_score(h1, h2)
# score_for_similar > 0.4

I need a dictionary with an original (path) as key and a list of possible similar or duplicates as value.

like:

result = {'/path/to/file.jpg': ['/path/to/file2.jpg', '/path/to/file3.jpg'], '/path/to/file4.jpg': []}

The second dict key value pair {'/path/to/'file4.jpg': []} is not necessary but helpful to have.

Currently I loop twice through the list and compare the values with each other. But I get double results.

I would be very greateful for your help.

P.S. to calculate the hamming-distance score I use:

def hamming_dist(h1, h2):
    h1 = list(h1)
    h2 = list(h2)
    score = scipy.spatial.distance.hamming(h1, h2)
    return score
WillHoh
  • 67
  • 11
  • How do you determinate which should be the original key? `/path/to/file.jpg` what's the criteria for that being the first key and not `/path/to/file2.jpg`? First occurrence? – Torxed Mar 16 '18 at 11:44
  • hi, it's enough if it's the first one. I just remembered that it is enough to get lists of "similar" ones. But they should only not appear in 2 lists. – WillHoh Mar 16 '18 at 11:46

2 Answers2

1
import Levenshtein as leven

# This is your list
xlist = [('94ff39ad', '/path/to/file.jpg'), ('512asasd', '/somepath/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]

# Here's what you'll base the difference upon:
simalarity_threshhold = 80 # 80%
results = {}

for item in xlist:
    path = item[1]
    print('Examining path:', path)

    max_similarity = None
    similarity_key = None
    for key in results:
        diff = leven.distance(path, key)
        diff_percentage = 100 / max(len(path), len(key)) * (min(len(path), len(key)-diff))

        print('    {} vs {} = {}%'.format(path, key, int(diff_percentage)))
        if diff_percentage > simalarity_threshhold:
            if not max_similarity or diff_percentage > max_similarity:
                max_similarity = diff_percentage
                similarity_key = key

    if not max_similarity:
        results[path] = {}
    else:
        results[similarity_key][path] = max_similarity

print(results)

If two paths have a similarity above 80% of the distance value of each other, they will be paired as a potential match. If another path is more similar, it will be added to that instead.

And if the path is below 80% similarity, it will create it's own result path/tree/structure.

Again, this is just one example of how you'd solve it.
There's many ways to do it, but I prefer Levenshtein because it's easy to work with and pretty accurate.

I left some debug code in there so you could see the way of thinking, what values are being passed where etc. again, this is all up to your rule-set for determining what's a match and not.

Oh, and I also stored the values as sub-dictionaries. This so that each potential candidate kept it's score it got upon checking. You could save them as a list as well. But lists are extremely slow in comparison to dictionaries, both in iteration, comparison and storing.

Second "oh".. This code is not regression tested.. There is bound to be some logical issues here.. Especially in the diff_percentage calculation.. I'm by noooo means a math wizard. But you catch my drift :)

Torxed
  • 22,866
  • 14
  • 82
  • 131
  • Hi, I will work with your way of solving this problem. I just wanted to note that I compare the hash values (created with the dHash algorithm). – WillHoh Mar 16 '18 at 13:46
0

to log how I solved the problem and to be a help for others, here is my code:

test = [('94ff39ad', '/path/to/file.jpg'), ('94ff39ad', '/path/to/file2.jpg'), ('94ff40ad', '/path/to/file3.jpg'), ('cab91acf', '/path/to/file4.jpg')]
seen = {}
new_seen = False
counter = 0
for x in test:
    added = False
    for k, v in seen.items():
        if x[0] == k or hamming_dist(x[0], k) < .4:
            v.append(x[1])
            added = True

    if not seen or not added:
        seen[x[0]] = [x[1]]

print(seen)

>> {'/path/to/file.jpg': ['/path/to/file2.jpg', '/path/to/file3.jpg'], '/path/to/file4.jpg': []}
WillHoh
  • 67
  • 11