0

I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks

dorothy
  • 1,213
  • 5
  • 20
  • 35

1 Answers1

2

If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:

Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.

Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.

Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.

This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

woolstar
  • 5,063
  • 20
  • 31