I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks
-
1use the `uniq` command – kyle k Nov 27 '13 at 23:59
-
`sort $file | uniq` if you don't care about order; `uniq $file` if duplicates are guaranteed to be successive. – Vector Gorgoth Nov 28 '13 at 00:00
-
2@VectorGorgoth How about just `sort -u $file`? – Macattack Nov 28 '13 at 00:02
-
@Macattack I always forget that flag for some reason. – Vector Gorgoth Nov 28 '13 at 00:04
-
2yeah for me the best way is **sort -u $file** – ederwander Nov 28 '13 at 00:05
-
Is there any good reason to expect `sort -u` to be efficient or even not to run out of memory with such a large file? – R.. GitHub STOP HELPING ICE Nov 28 '13 at 00:23
-
@R.. it's an implementation detail, but GNU sort handles it through temp files. – that other guy Nov 28 '13 at 01:51
-
@R.. Perhaps that has already been answered here: http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file (yes, it will handle large files) – Macattack Nov 28 '13 at 01:51
-
do you mean adjacent duplicate lines, or any lines? how many lines are there (or what is the average line length)? – ysth Nov 28 '13 at 02:33
-
Is this something you're only going to have to do once, or is this a recurring job? (ie. how important is it to optimize programmer time vs. running time?) – Joe Z Nov 28 '13 at 03:07
-
hi, the duplicate lines can be anywhere, not just adjacent and its a recurring job. thanks. – dorothy Nov 28 '13 at 05:41
1 Answers
If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:
Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.
Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.
Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.
This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

- 5,063
- 20
- 31