Efficiently identify duplicates in large list (500,000+)

Question

I have a large list of DOI's and I need the most efficient way to identify the DOI's which are repeated (ie. print out the index and the DOI for values which are repeated.) The array of DOI's could consist of 500,000 + DOI's. My current approach is this (inspired by this answer):

from collections import defaultdict
D = defaultdict(list)
for i,item in enumerate(doiList):
    D[item].append(i)
D = {k:v for k,v in D.items() if len(v)>1}
print (D)

Is there a more processing efficient way of doing this?

Sample DOI List:

doiList = ['10.1016/j.ijnurstu.2017.05.011 [doi]','10.1016/j.ijnurstu.2017.05.011 [doi]' ,'10.1167/iovs.16-20421 [doi]', '10.1093/cid/cix478 [doi]', '10.1038/bjc.2017.133 [doi]', '10.3892/or.2017.5646 [doi]', '10.1177/0961203317711009 [doi]', '10.2217/bmm-2017-0087 [doi]', '10.1007/s12016-017-8611-x [doi]', '10.1007/s10753-017-0594-5 [doi]', '10.1186/s13601-017-0150-2 [doi]', '10.3389/fimmu.2017.00515 [doi]', '10.2147/JAA.S131506 [doi]', '10.2147/JAA.S128431 [doi]', '10.1038/s41598-017-02293-z [doi]', '10.18632/oncotarget.17729 [doi]', '10.1073/pnas.1703683114 [doi]', '10.1096/fj.201600857RRR [doi]', '10.1128/AAC.00020-17 [doi]', '10.1016/j.jpain.2017.04.011 [doi]', '10.1016/j.jaip.2017.04.029 [doi]', '10.1016/j.anai.2017.04.021 [doi]', '10.1016/j.alit.2017.05.001 [doi]']

What amount of repetition do you expect? Will 10% be repeats, or 50% or 90% of the list be repeats? — aghast, May 25 '17 at 23:53
Most efficient processing wise, or memory wise? You'll have to make a trade-off... — zwer, May 25 '17 at 23:56
If there are 300,000 distinct strings, at 40 glyphs each, that's 4 * 40 * 300k = 48 M of data. That can probably survive in memory. — aghast, May 26 '17 at 00:10

aghast · Answer 1 · 2017-05-26T00:07:55.583

3

Try storing them in a set instead. You can append the duplicates to a single list, which might speed things up:

seen = set()
dupes = []

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen.add(doi)
    else:
        dupes.append(i)

At this point, seen contains all the distinct doi values, while dupes contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList to determine which index corresponds to which value.

To get some more performance out of this, you can cache the methods:

seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen_add(doi)
    else:
        dupes_append(i)

edited May 26 '17 at 00:07

answered May 26 '17 at 00:01

aghast

14,785
3
24
56

1

Did `set` can handle this large amount of data without overheating the CPU? I don't have a file like this to test your solution. – Chiheb Nexus May 26 '17 at 00:12
1

@ChihebNexus Yep it handled it, however, speed improvements were minimal. – scutnex May 26 '17 at 00:25

zwer · Answer 2 · 2017-05-26T00:30:38.307

Here's a full solution that returns the same data set as your example, just more than twice faster (at the expense of some memory):

def identify_duplicates(data):
    lookup = {}  # store our quick lookup here
    result = {}  # store for our final result
    for i, v in enumerate(data):
        if v in lookup:  # if already in the lookup table it's a duplicate
            if v not in result:  # add it to the result set
                result[v] = lookup[v]
            lookup[v][1] += 1  # increase duplicate count
        else:
            lookup[v] = [i, 0]  # default state for non-duplicates
    return result

print(identify_duplicates(doiList))
# prints: {'10.1016/j.ijnurstu.2017.05.011 [doi]': [0, 1]}

The stored index is the first occurrence of the found duplicate as in your example. If you want to store all the duplicate indexes you can add lookup[v].append(i) after the lookup[v][1] += 1 line, but then the data might look weird (the structure would be [first_index, number_of_occurrences, second_index, third_index...])

Instead just flip the stored parameters in lookup[v] modifications - lookup[v] = [0, i] instead of lookup[v] = [i, 0] and lookup[v][0] += 1 instead of lookup[v][1] += 1, then lookup[v].append(i) after it will give you a nice result in the form of: [number_of_occurrences, first_index, second_index, third_index...].

Efficiently identify duplicates in large list (500,000+)

2 Answers2

Linked