4

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:

a 1
b 2
c 445
a 1

We need to get file "B.txt", that will not have such duplicates. As example, it should be this:

a 1
b 2
c 445

I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.

But I think it is way too slow. What can I use?

A is not database! No SQL, please.

Sorry, that didn't said, sorting is OK.

Although it can be sorted, what about if it cannot be sorted?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Monory
  • 103
  • 1
  • 8
  • 1
    I think CouchDB is a NoSQL solution. – Kerrek SB Nov 03 '11 at 13:30
  • You said that 'A.txt' doesn't fit in memory. Do you know if the unique elements of A (ie, the result in B) will fit in memory? I assume they will not but if they do it simplifies the problem greatly. – Kane Nov 03 '11 at 13:52
  • B will not fit in memory, too. – Monory Nov 03 '11 at 14:00
  • Do you have Microsoft Access? This could easily be done in Access even with your txt files. – asus3000 Nov 03 '11 at 14:10
  • I need algorithm, not the implementation. – Monory Nov 03 '11 at 14:46
  • @Monory - Why, is this homework? – mbeckish Nov 03 '11 at 14:49
  • @mbeckish, no, I had this question once when I went to employer, said my first idea, and he replied that it is OK, but not good, and I just want to know algorithms better than my, and to help someone, too :D – Monory Nov 03 '11 at 15:02
  • @Monory - Ok. It's just that a DB is a good fit for this, because the DBMS designers have already done all the hard work involved with sorting / searching / filtering large data sets. Seems counter-productive to implement your own less robust subset of that functionality. – mbeckish Nov 03 '11 at 16:02
  • @mbeckish I know that this is good for my problem, I used Access once. But what if I need this thing in my server that runs Linux? And, developers of Access also used some algorithm, and I am interested in it. – Monory Nov 04 '11 at 17:02
  • @Monory - I wouldn't recommend Access, but any number of DBMS would suffice. If you're interested in how they are implemented, then I'm sure a little Googling would turn up lots of resources. – mbeckish Nov 07 '11 at 20:53

3 Answers3

6

One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.

Then the question becomes: how do you sort a file that is too big to fit in memory?

Here's how Unix sort does it.

See also this question.

Community
  • 1
  • 1
mbeckish
  • 10,485
  • 5
  • 30
  • 55
  • But then, the file is not in the original order. However, no -1, because that might not be bad. OP should tell. – Rok Kralj Nov 03 '11 at 13:37
  • 1
    You could decorate the file with line numbers, sort it on the key(s) you want to de-dup, de-dup, then re-sort on (and strip) line numbers. – Vatine Nov 03 '11 at 15:08
4

Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.

Process:

  1. Allocate a few megabytes for input buffer I, a few gigabytes for a result buffer R, and a gigabyte or so for a hash table H. Open input file F and output file O.
  2. Repeat: Fill I from F and process it into R, via step 3.
  3. For each line L in I, check if L is already in H and R. If so, go on to next L, else add L to R and its hash to H.
  4. When R is full, say with M entries, write it to O. Then repeatedly fill I from F, dedup as in step 3, and write to O. At EOF(F) go to 5.
  5. Repeat (using old O as input F and a new O for output): Read M lines from F and copy to O. Then load R and H as in steps 2 and 3, and copy to EOF(F) with dedup as before. Set M to the new number of non-dupped lines at the beginning of each O file.

Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.

Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.

James Waldby - jwpat7
  • 8,593
  • 2
  • 22
  • 37
2

You will essentially have to build up a searchable result set (if the language reminds of you database technology, this is no accident, no matter how much you hate the fact that databases deal with the same questions as you do).

One of the possible efficient data structures for that is either a sorted range (implementable as a tree of some sort), or a hash table. So as you process your file, you insert each record into your result set, efficiently, and at that stage you get to check whether the result already exists. When you're done, you will have a reduced set of unique records.

Rather than duplicating the actual record, your result set could also store a reference of some sort to any one of the original records. It depends on whether the records are large enough to make that a more efficient solution.

Or you could simply add a mark to the original data whether or not the record is to be included.

(Also consider using an efficient storage format like NetCDF for binary data, as a textual representation is far far slower to access and process.)

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084