Find double lines; a faster way

Question

This is what I do to find all double lines in a textfile

import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]

#sort list
newsorting = sorted(z)

#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
    if (i+1) <= len(newsorting):
        if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
                listdoubles.append(newsorting[i][1])
                listdoubles.append(newsorting[i+1][1])

#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)

But it is very slow. When I have over 10.000 lines it takes 10 seconds to create this list.

Is there a way to do it faster?

6502 · Accepted Answer · 2016-03-27T13:19:08.277

4

You can use a simpler approach:

for each line
if it has been seen before then display it
else add it to the set of known lines

In code:

seen = set()
for L in f:
    if L in seen:
        print(L)
    else:
        seen.add(L)

If you want to display the line numbers where duplicates are appearing the code can be simply changed to use a dictionary mapping line content to the line number its text has been seen for the first time:

seen = {}
for n, L in enumerate(f):
    if L in seen:
        print("Line %i is a duplicate of line %i" % (n, seen[L]))
    else:
        seen[L] = n

Both dict and set in Python are based on hashing and provide constant-time lookup operations.

EDIT

If you need only the line numbers of last duplicate of a line then the output clearly cannot be done during the processing but you will have first to process the whole input before emitting any output...

# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
    if L in lastdup:
        lastdup[L] = n
    else:
        lastdup[L] = None

# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)

edited Mar 27 '16 at 13:19

answered Mar 26 '16 at 08:29

6502

112,025
15
165
265

It's worth mentioning that identity checking is O(1) for a set, compared to an ever growing list, or some other method. – Reti43 Mar 26 '16 at 08:33
So this is just as fast as using a dict? – Bemmu Mar 26 '16 at 08:34
1

@Bemmu Yep, both implement hashtables. In fact, I consider sets to be like dicts with only keys and value set to None. https://wiki.python.org/moin/TimeComplexity. Additional reading [here](http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table). – Reti43 Mar 26 '16 at 08:35
@6502, Such a simple solution! Thank you. Just one thing; I would like to have the line numbers of the doubles not the lines. Linenumbers=[[linenr,linenrdouble],[linenr,linenrdouble],etc] Can I use your code to obtain the linenumbers? – Reman Mar 26 '16 at 08:43
1

@Reman: see edited version for code that display line numbers for duplicates. – 6502 Mar 26 '16 at 08:51
@6502, it is much faster then my solution :) I've seen that your solution writes also the doubles empty lines. Is there a way to leave them out? – Reman Mar 26 '16 at 10:34
@6502, in my original script I have also a function to add only the last double line in a list, it is a bit difficult in your way but I'll go on testing. :) – Reman Mar 26 '16 at 10:43
@6502 Ciao, Buona Pasqua. :) Vorrei chiederti ancora una cosa se posso. Sto cercando anche di ottenere i numeri di tutte le linee doppie incluse le originali, ma non l'ultima doppia. es. se le line 3,5,8 sono doppie, vorrei avere soltanto il numero 8. Come posso fare? – Reman Mar 27 '16 at 10:48
@6502 Tu mi detesterai.. non mi serve la linea 8, ma la linea 3 e 5. La ragione è che voglio togliere tutte le doppie a parte della ultima trovata. – Reman Mar 27 '16 at 13:46
Trovato. Grazie comunque. Ho imparato cose nuove. – Reman Mar 29 '16 at 08:12

Find double lines; a faster way

1 Answers1

EDIT