Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:
a 1
b 2
c 445
a 1
We need to get file "B.txt", that will not have such duplicates. As example, it should be this:
a 1
b 2
c 445
I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.
But I think it is way too slow. What can I use?
A is not database! No SQL, please.
Sorry, that didn't said, sorting is OK.
Although it can be sorted, what about if it cannot be sorted?