5

I have a huge text file that has duplicate lines. The size would be about 150000000 lines. I'd like to find the most efficient way to read these lines in and eliminate duplicates. Some of the approaches I'm considering are as follows :-

  1. Read the whole file in, do a list(set(lines)).
  2. Read 10k lines in at a time, do a list(set(lines)) on what I have, read another 10k lines into the list, do a list(set(lines)). Repeat.

How would you approach this problem? Would any form of multiprocessing help?

  • Is the file sorted? Is there a fixed number of characters per line? How many characters are there approximately per line? – tommy.carstensen Apr 26 '15 at 17:02
  • You don't care about the current ordering of the lines? Because using set will likely mess them up. – Shashank Apr 26 '15 at 17:02
  • Your Both approaches point at the same thing, the `set()` in this case won't do any magic as comparing strings costs `O(mn)`, and when you are taking about `150000000` lines this approach doesn’t seems to be feasible I guess. – ZdaR Apr 26 '15 at 17:04
  • @tommy.carstensen The file is not sorted. The number of characters per line vary. ~13-20 characters per line. –  Apr 26 '15 at 17:05
  • @Shashank No, I dont care about the ordering of the lines. –  Apr 26 '15 at 17:05
  • 1
    How large a percentage of your lines are duplicates? More than 50%? Less than 1%? I would divide and conquer, if a large percentage of the lines were duplicates. – tommy.carstensen Apr 26 '15 at 17:14
  • @joe-koberg left an answer to the same question [here](http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python). The user log0 explains [here](http://stackoverflow.com/questions/22751000/split-large-text-filearound-50gb-into-multiple-files) how to split a large file into smaller chunks. – tommy.carstensen Apr 26 '15 at 17:28

3 Answers3

7

Multiprocessing will not really help, because your bottleneck is memory. You will need to use hashes:

  1. Read line
  2. Calculate hash, e.g. md5, look it up in a set of all encountered hashes.
  3. Output line if hash not found in set and add this hash to set.

Couple things to be mindful of:

  • md5 takes 128 bits, so even without overhead it is more than 2G of ram.
  • set and dict have large memory overhead.

So if you have 4+ gigs, it is doable. A more scalable solution would be to store encountered hashes in sorted file(s) on disk, and search through them every time. This will be (a lot!) slower, but you can have as low memory footprint as you want.

Also if you don't care about line ordering in resulting file, you can split your file into smaller files based on some hash function (lines with md5 starting with a, lines with md5 starting with b etc). This will allow you to make them small enough to just sort | uniq them (or sort in-memory with python, if you wish) and concatenate results.

thule
  • 4,034
  • 21
  • 31
  • The lines are 13-20 characters on average, so I guess there is no need to calculate md5 (128 bits)? – tommy.carstensen Apr 26 '15 at 17:09
  • Guess not) Or you can use a shorter hash, e.g. python's hash() function and save memory. Not sure how good it is though. – thule Apr 26 '15 at 17:11
  • 1
    @tommy.carstensen I guess there is, Because total lines are `150000000` and matching strings over such a large data set would consume a lot of time, I guess this hashing approach is better than `set()` – ZdaR Apr 26 '15 at 17:12
  • 1
    I like this hashing approach. The question is what to do about hash collisions. You can have two strings that are different but their hash is the same. I don't know how good the default hash function is so it might still be better to use md5 as suggested by @letitbee. – Roman Kutlak Apr 26 '15 at 17:37
  • I think you can truncate sha1 and still get pretty good entropy. But with input set so large I don't know how good an idea it is. – thule Apr 26 '15 at 18:26
5

The memory here is the problem, so it's possible loading the entire file into memory is not an option.

One potential option since you don't need to maintain the ordering of the lines is to do some sort of radix sorting:

for each line in file:
    put this line into a new file based on the first character

The new files should now be a fair bit smaller, and you can recursively split the files based on the 2nd, 3rd, etc. characters in the case that some of the files are still too big (say for example, every line in your original file starts with an a).

Once these files are small enough to fit into memory, you can do your list(set(list)) approach and then cat the files back together when you're done. Or you can just use the uniq UNIX command-line tool if that's available to you.

Note that the radix sort part can be easily parallelized, since each line is independent of the others.

robbrit
  • 17,560
  • 4
  • 48
  • 68
0

Think about if you really need to solve this in python itself. You could

  • call out to sort and uniq, standard tools that are present on most posix systems. They will do the job, are faster and solve edge cases (eg running out of memory) before you thought about them
  • The most simple solution would probably be to create an in-memory database using the sqlite-package, insert all lines into a temporary table and do a select distinct... from it. Again, sqlite will perform better than you could do yourself in pure python.
user2722968
  • 13,636
  • 2
  • 46
  • 67
  • 1
    No, sqlite will make it worse, not better. You could actually do it with a regular, disk-based database, but it is grossly inefficient and also this is not solving a problem, it is letting someone else solve it. – thule Apr 26 '15 at 17:20