2

As the title says, I want to find and delete duplicate lines in a file. That is pretty easy to do...the catch is that i want to know what is the fastest and most efficient way to do that (let's say that you have gigabytes worth of files and you want to do this as efficient and as fast as you can)

If you know some method...as complicated as it is that can do that I would like to know. I heard some stuff like loop unrolling and started to second guess that the most simple things are the fastest so I am curious.

  • Possible duplicate of [How might I remove duplicate lines from a file?](http://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file) – Yevhen Kuzmovych Nov 24 '16 at 15:45
  • Check this as well: http://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix – Mohammad Yusuf Nov 24 '16 at 15:46
  • I don't see any of the answers in the threads you linked deal with matters of performance. – sobek Nov 24 '16 at 15:48
  • 1
    Please [edit] your question to show [what you have tried so far](http://whathaveyoutried.com). You should include a [mcve] of the code that you are having problems with, then we can try to help with the specific problem. You should also read [ask]. – Toby Speight Nov 24 '16 at 15:56
  • if doing it with python is not a strict requirement then `uniq file_with_dupes > file_without_dupes` is perhaps the easiest and fastest way to do it. – sid-m Nov 24 '16 at 18:00

2 Answers2

2

The best solution is the keep a set of the lines seen so far, and return only the ones not in it. This approach is used in python's collections implementation

def unique_lines(filename):
   lines = open(filename).readlines()
   seen = set()

   for line in lines:
       if line not in seen:
           yield line
           seen.add(line)

and then

for unique_line in unique_lines(filename)
    # do stuff

Of course, if you don't care about the order, you can convert the whole text to a set directly, like

set(open(filename).readlines())
Jonathan Eunice
  • 21,653
  • 6
  • 75
  • 77
blue_note
  • 27,712
  • 9
  • 72
  • 90
  • Wouldn't `open(filename).readlines()` create a list in memory with all the lines? It doesn't seem memory eficient. You can do this instead: `f = open(filename); for line in f: ...`. Also, it's best practice to use a `with` statement when dealing with files. – Roy Cohen Jun 06 '21 at 21:09
0

Use python hashlib to hash every line in file to a unique hash... And check if a line is duplicate, lookup into the hashes in a set

Lines can be kept directly in a set, however, hashing will reduce space required.

https://docs.python.org/3/library/hashlib.html

SaiNageswar S
  • 1,203
  • 13
  • 22