1

I have a large list of DOI's and I need the most efficient way to identify the DOI's which are repeated (ie. print out the index and the DOI for values which are repeated.) The array of DOI's could consist of 500,000 + DOI's. My current approach is this (inspired by this answer):

from collections import defaultdict
D = defaultdict(list)
for i,item in enumerate(doiList):
    D[item].append(i)
D = {k:v for k,v in D.items() if len(v)>1}
print (D)

Is there a more processing efficient way of doing this?

Sample DOI List:

doiList = ['10.1016/j.ijnurstu.2017.05.011 [doi]','10.1016/j.ijnurstu.2017.05.011 [doi]' ,'10.1167/iovs.16-20421 [doi]', '10.1093/cid/cix478 [doi]', '10.1038/bjc.2017.133 [doi]', '10.3892/or.2017.5646 [doi]', '10.1177/0961203317711009 [doi]', '10.2217/bmm-2017-0087 [doi]', '10.1007/s12016-017-8611-x [doi]', '10.1007/s10753-017-0594-5 [doi]', '10.1186/s13601-017-0150-2 [doi]', '10.3389/fimmu.2017.00515 [doi]', '10.2147/JAA.S131506 [doi]', '10.2147/JAA.S128431 [doi]', '10.1038/s41598-017-02293-z [doi]', '10.18632/oncotarget.17729 [doi]', '10.1073/pnas.1703683114 [doi]', '10.1096/fj.201600857RRR [doi]', '10.1128/AAC.00020-17 [doi]', '10.1016/j.jpain.2017.04.011 [doi]', '10.1016/j.jaip.2017.04.029 [doi]', '10.1016/j.anai.2017.04.021 [doi]', '10.1016/j.alit.2017.05.001 [doi]']
scutnex
  • 813
  • 1
  • 9
  • 19

2 Answers2

3

Try storing them in a set instead. You can append the duplicates to a single list, which might speed things up:

seen = set()
dupes = []

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen.add(doi)
    else:
        dupes.append(i)

At this point, seen contains all the distinct doi values, while dupes contains all the 2nd, 3rd, etc. indexes of duplicate values. You can look them up in doiList to determine which index corresponds to which value.

To get some more performance out of this, you can cache the methods:

seen = set()
seen_add = seen.add
dupes = []
dupes_append = dupes.append

for i, doi in enumerate(doiList):
    if doi not in seen:
        seen_add(doi)
    else:
        dupes_append(i)
aghast
  • 14,785
  • 3
  • 24
  • 56
1

Here's a full solution that returns the same data set as your example, just more than twice faster (at the expense of some memory):

def identify_duplicates(data):
    lookup = {}  # store our quick lookup here
    result = {}  # store for our final result
    for i, v in enumerate(data):
        if v in lookup:  # if already in the lookup table it's a duplicate
            if v not in result:  # add it to the result set
                result[v] = lookup[v]
            lookup[v][1] += 1  # increase duplicate count
        else:
            lookup[v] = [i, 0]  # default state for non-duplicates
    return result

print(identify_duplicates(doiList))
# prints: {'10.1016/j.ijnurstu.2017.05.011 [doi]': [0, 1]}

The stored index is the first occurrence of the found duplicate as in your example. If you want to store all the duplicate indexes you can add lookup[v].append(i) after the lookup[v][1] += 1 line, but then the data might look weird (the structure would be [first_index, number_of_occurrences, second_index, third_index...])

Instead just flip the stored parameters in lookup[v] modifications - lookup[v] = [0, i] instead of lookup[v] = [i, 0] and lookup[v][0] += 1 instead of lookup[v][1] += 1, then lookup[v].append(i) after it will give you a nice result in the form of: [number_of_occurrences, first_index, second_index, third_index...].

zwer
  • 24,943
  • 3
  • 48
  • 66