How to remove duplicate words using Java when words are more than 200 million?

Question

I have a file (size = ~1.9 GB) which contains ~220,000,000 (~220 million) words / strings. They have duplication, almost 1 duplicate word every 100 words.

In my second program, I want to read the file. I am successful to read the file by lines using BufferedReader.

Now to remove duplicates, we can use Set (and it's implementations), but Set has problems, as described following in 3 different scenarios:

With default JVM size, Set can contain up to 0.7-0.8 million words, and then OutOfMemoryError.
With 512M JVM size, Set can contain up to 5-6 million words, and then OOM error.
With 1024M JVM size, Set can contain up to 12-13 million words, and then OOM error. Here after 10 million records addition into Set, operations become extremely slow. For example, addition of next ~4000 records, it took 60 seconds.

I have restrictions that I can't increase the JVM size further, and I want to remove duplicate words from the file.

Please let me know if you have any idea about any other ways/approaches to remove duplicate words using Java from such a gigantic file. Many Thanks :)

Addition of info to question: My words are basically alpha-numeric and they are IDs which are unique in our system. Hence they are not plain English words.

for the solution, could you use a database or even a second file to store result? — Francisco Spaeth, Sep 19 '12 at 19:04
I would make sure I have enough memory for the task. You can buy 16 GB PC memory for about $100. It doesn't cost that much these days. — Peter Lawrey, Sep 19 '12 at 19:31
What do you mean with "one duplicate word every 100 words"? Every sublist of length 100 contains on average one element twice? Or 99% of the words are unique in the file? — meriton, Sep 19 '12 at 20:36
[External Sorting](http://en.wikipedia.org/wiki/External_sorting) is the key here. After the words are sorted all duplicates will be consecutive. Then you make one more quick pass over the data. filtering out duplicates. — NovaDenizen, Sep 19 '12 at 20:48
Define "word" here. Is it really a standard spoken language word, or something else? — Hot Licks, Sep 19 '12 at 22:55
@Peter Lawrey: In theory, I'd definitely agree with you. However this can depend on your organization. I've been in contact with places where hardware hosting is outsourced where a $100 PC whouldn't be allowed near a data center. You'd need to buy an expensive server, that also costs a small fortune each month just to sit there running, sorting a somewhat large file of strings every now and then. — Buhb, Sep 20 '12 at 04:27
Please keep in mind that the file itself is larger than 1024M, so any attempt to store all strings in memory without some sophisticated compression scheme is guaranteed to fail. — Buhb, Sep 20 '12 at 04:34
@Buhb While that is true, organisations with a data centre which spend say $3K on a PC won't be paying you minimum wage. AFAIK, The cost of your time to the company compared with the cost of the hardware+support tends to be similar whether the organisation is lean or not. — Peter Lawrey, Sep 20 '12 at 07:11
Is retaining the order of words important? Do you have to use Java or would something like `sort -u` be possible? — Axel, Sep 20 '12 at 14:02
As to your statements about how many words can be held in memory, have you made sure your words do not consume more memory than needed? I mean things like storing substrings without explicitly constructing new String instances (as this will share the original strings' buffer thus increasing memory usage). However, with ~220mio words, this won't solve your problem... — Axel, Sep 20 '12 at 14:09
How many characters in one of your "words". Have you computed how many distinct words might be formed? — Hot Licks, Sep 20 '12 at 21:24

Tobias Ritzau · Answer 1 · 2012-09-20T04:38:56.080

14

Use merge sort and remove the duplicates in a second pass. You could even remove the duplicates while merging (just keep the latest word added to output in RAM and compare the candidates to it as well).

edited Sep 20 '12 at 04:38

answered Sep 19 '12 at 19:07

Tobias Ritzau

3,327
2
18
29

+1. This should be fairly straightforward with well-established tools for the problem. – Louis Wasserman Sep 19 '12 at 19:41
3

And yet may led to OutOfMemory – Lukasz Madon Sep 19 '12 at 23:01
1

@lukas, how do you see that happen? merge sort can be extremely low on RAM. – Tobias Ritzau Sep 20 '12 at 04:31

Gilbert Le Blanc · Answer 2 · 2012-09-19T19:21:35.933

11

Divide the huge file into 26 smaller files based on the first letter of the word. If any of the letter files are still too large, divide that letter file by using the second letter.

Process each of the letter files separately using a Set to remove duplicates.

edited Sep 19 '12 at 19:21

answered Sep 19 '12 at 19:07

Gilbert Le Blanc

50,182
6
67
111

1

That would assume that `Q` is as frequent as `A` or you may go over the 10M words that that fit for some letters. – Joachim Isaksson Sep 19 '12 at 19:12
@Joachim Isaksson: Fine. Break up the largest files by the first two letters. – Gilbert Le Blanc Sep 19 '12 at 19:17
3

I find this solution both more complicated to explain and more complicated to implement than the straightforward sort-based solutions given by others. Sorting big files on disk is a common task with ready-made implementations. The whole "subdivide the larger files if they are still too large" begs for either more code or manual intervention. It's really just much simpler to go ahead and sort the whole thing and be done with it. – John Y Sep 20 '12 at 01:48
The implication John Y mentioned aside, you can divide your file based on hashcode() % n for a reasonable number n. – Buhb Sep 20 '12 at 04:22
@JohnY & Buhb: My method guarantees that there will be no duplicate words across multiple files. All of any individual duplicate word will be within one file. – Gilbert Le Blanc Sep 20 '12 at 13:13
@GilbertLeBlanc: But the OP starts with **one** gigantic file. If we sort **that one** file, then we don't even introduce the notion of "duplicates across multiple files". – John Y Sep 20 '12 at 14:06
@GilbertLeBlanc I'm merely addressing the problem that occurs if 99% of the words starts with the same letter. Since your edit that suggests recursive subdivision of files this is no longer an issue. – Buhb Sep 21 '12 at 06:50

score 7 · Answer 3 · answered Sep 19 '12 at 21:33

7

You might be able to use a trie data structure to do the job in one pass. It has advantages that recommend it for this type of problem. Lookup and insert are quick. And its representation is relatively space efficient. You might be able to represent all of your words in RAM.

answered Sep 19 '12 at 21:33

gregg

1,027
5
6

This is one of the most interesting suggestions so far. You might run out of RAM and then you need to look at an entirely new solution, but this at least offers some hope of storing all unique Strings in memory, which is convenient. – Buhb Sep 21 '12 at 07:11
you still need more than one node pro distinct word - also at least 8 bytes even if you do not store strings itself, and array of links pro node – Konstantin Pribluda Oct 09 '12 at 08:41

score 5 · Answer 4 · answered Sep 19 '12 at 19:17

5

If you sort the items, duplicates will be easy to detect and remove, as the duplicates will bunch together.

There is code here you could use to mergesort the large file: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

answered Sep 19 '12 at 19:17

ᴇʟᴇvᴀтᴇ

12,285
4
43
66

score 4 · Answer 5 · answered Sep 19 '12 at 19:07

For large files I try not to read the data into memory but instead operate on a memory mapped file and let the OS page in/out memory as needed. If your set structures contain offsets into this memory mapped file instead of the actual strings it would consume significantly less memory.

Check out this article:

http://javarevisited.blogspot.com/2012/01/memorymapped-file-and-io-in-java.html

score 4 · Answer 6 · answered Sep 19 '12 at 19:13

Question: Are these really WORDS, or are they something else -- phrases, part numbers, etc?

For WORDS in a common spoken language one would expect that after the first couple of thousand you'd have found most of the unique words, so all you really need to do is read a word in, check it against a dictionary, if found skip it, if not found add it to the dictionary and write it out.

In this case your dictionary is only a few thousand words large. And you don't need to retain the source file since you write out the unique words as soon as you find them (or you can simply dump the dictionary when you're done).

score 4 · Answer 7 · answered Sep 19 '12 at 20:03

4

If you have the posibility to insert the words in a temporary table of a database (using batch inserts), then it would be a select distinct towards that table.

answered Sep 19 '12 at 20:03

Carlos Cambón

116
4

score 3 · Answer 8 · answered Sep 19 '12 at 19:10

3

One classic way to solve this kind of problem is a Bloom filter. Basically you hash your word a number of times and for each hash result set some bits in a bit vector. If you're checking a word and all the bits from its hashes are set in the vector you've probably (you can set this probability arbitrarily low by increasing the number of hashes/bits in the vector) seen it before and it's a duplicate.

This was how early spell checkers worked. They knew if a word was in the dictionary, but they couldn't tell you what the correct spelling was because it only tell you if the current word is seen.

There are a number of open source implementations out there including java-bloomfilter

answered Sep 19 '12 at 19:10

Paul Rubel

26,632
7
60
80

How would you verify that it actually is a duplicate (and not a false positive)? – Tobias Ritzau Sep 19 '12 at 19:17
You can set the probability arbitrarily low, at the cost of memory. Unfortunately it's the price you pay for a probabilistic algorithm. Given your constraints, data size, and the fact that you don't need to check additional members after the fact the sorting solutions are likely more appropriate. – Paul Rubel Sep 19 '12 at 20:23
2

A Bloom filter would be unnecessarily inexact. – NovaDenizen Sep 19 '12 at 20:44
@TobiasRitzau If the Bloom filter indicates that the word is probably a duplicate, you could (at great expense), seek back to the beginning of the output file, and scan it for the word to verify it really is there. – Samuel Edwin Ward Sep 20 '12 at 00:30
I haven't used bloom filters for real, but I was thinking of a solution where you make one pass to make the filter for the set, and take a second pass where you store the misses in the output file. The hits are kept in a set and only written if it does not already exist in the set. Since there where only about 1% duplicates this should work. Or? – Tobias Ritzau Sep 20 '12 at 04:35

score 1 · Answer 9 · answered Sep 19 '12 at 19:09

I'd tackle this in Java the same way as in every other language: Write a deduplication filter and pipe it as often as necessary.

This is what I mean (in pseudo code):

Input parameters: Offset, Size
Allocate searchable structure of size Size (=Set, but need not be one)
Read Offset (or EOF is encountered) elements from stdin and just copy them to stdout
Read Size elments from stdin (or EOF), store them in Set. If duplicate, drop, else write to stdout.
Read elements from stdin until EOF, if they are in Set then drop, else write to stdout

Now pipe as many instances as you need (If storage is no problem, maybe only as many as you have cores) with increasing Offsets and sane Size. This lets you use more cores, as I suspect the process is CPU bound. You can even use netcat and spread processing over more machines, if you are in a hurry.

score 1 · Answer 10 · edited Sep 26 '12 at 20:13

To not have to worry to much about implementation you should use a database system, either plain old relational SQL or a No-SQL solution. Im pretty sure you could use e.g. Berkeley DB java edition and then do (pseudo code)

for(word : stream) {
  if(!DB.exists(word)) {
     DB.put(word)
     outstream.add(word)
  }
}

The problem is in essence easy, you need to store things on disk because there is not enough memory, then either use sorting O(N log N) (unecessary) or hashing O(N) to find the unique words.

If you want a solution that will very likely work but is not guaranteed to do so use a LRU type hash table. According to the empirical Zpif's law you should be OK.

A follow up question to some smart guy out there, what if I have 64-bit machine and set heap size to say 12GB, shouldn't virtual memory take care of the problem (although not in an optimal way) or is java not designed this way?

score 1 · Answer 11 · answered Sep 20 '12 at 02:32

1

Even in English, which has a huge number of words for a natural language, the upper estimates are only about 80000 words. Based on that, you could just use a HashSet and add all your words it (probably in all lower case to avoid case issues):

Set<String> words = new HashSet<String>();
while (read-next-word) {
    words.add(word.toLowerCase());
}

If they are real words, this isn't going to cause memory problems, will will be pretty fast too!

answered Sep 20 '12 at 02:32

Bohemian

412,405
93
575
722

That's the first I thought too, but in the topic he says they already tried with Sets and failed. They must not be real words – enTropy Sep 21 '12 at 20:23

score 0 · Answer 12 · edited May 23 '17 at 12:21

0

Quicksort would be a good option over Mergesort in this case because it needs less memory. This thread has a good explanation as to why.

edited May 23 '17 at 12:21

Community

1
1

answered Sep 19 '12 at 20:17

Evo510

173
1
1
15

6

But quicksort is an in-memory sort, and mergesort only really requires enough ram to hold 2 read buffers and a write buffer. – NovaDenizen Sep 19 '12 at 20:43

Konstantin Pribluda · Answer 13 · 2012-10-09T09:43:16.343

Most performant solutions arise from omiting unecessary stuff. You look only for duplicates, so just do not store words itself, store hashes. But wait, you are not interested in hashes either, only if they awere seen already - do not store them. Treat hash as really large number, and use bitset to see whether you already seen this number.

So your problem boils down to really big sparse populated bitmap - with size depending on hash width. If your hash is up to 32 bit, you can use riak bitmap.

... gone thinking about really big bitmap for 128+ bit hashes %) (I'll be back )

How to remove duplicate words using Java when words are more than 200 million?

13 Answers13

Linked