Duplicate detection on very large data

Question

I'm very new to Java so forgive me if I'm doing something terribly wrong.

I'm working on a project where I need to quickly scan a very large volume of data (CSV with 50 million lines or more, 5 entries per line) for repeats. I've resorted to using a HashMap, since its .contains() method is fast.

However, I end up having to store a million keys or more in the map. Each key is associated with an int[] array, which would have from 1 to 100 entries as well. So obviously, I end up getting an OutOfMemory error unless I'm using a laptop with ~16 GB of RAM.

I was thinking that once the HashMap gets more than N keys or a key gets more than N entries, I could write it somewhere and clear it. However, not all keys or values are found at once, so I need to be able to add to the written hashmap, and not overwrite it.

I've searched far and wide and still can't find a way to do it, so thanks a lot to whoever can help!

If those lines have some identifying attribute, then you could store just that, instead of the whole line. Or maybe create a hashcode from the line, and only if you find potential collisions store those hashes to another list and do a second iteration to see whether those entries with same hashcode are truly identical. — tobias_k, Jul 30 '14 at 12:57
Try to specify -Xmx3G (or -Xmx4G, -Xmx5G, ...) in 'java' command. You need to try to increase your Heap size. Hashmap can hold very big data. — DmitryKanunnikoff, Jul 30 '14 at 12:57
@DmitryTsechoev and thanks too xD I'm running it with -Xmx14g so I figured it's getting kind of ridiculous — larrienea, Jul 30 '14 at 13:09
Note that while `HashMap` does have an efficient `.contains()` method, the proper class to be using for duplicate detection is a [`Set`](http://docs.oracle.com/javase/7/docs/api/java/util/Set.html), such as [`HashSet`](http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html). — dimo414, Jul 30 '14 at 13:26

score 4 · Answer 1 · edited May 23 '17 at 12:15

You have quite a lot of options here, I'll list some of them:

More Memory: It sounds like you've already tried giving Java more memory, but if not, using the -Xmx compiler flag - e.g. -Xmx3G as Dimitry suggests will give you three gigabytes of heap, vs. the default which is <= 1GB.
Store Less Data: You're currently storing the whole row of "1 to 100 entries", when really all we need is to know if the data is unique or not. The Arrays.hashCode() function gives you a reasonably accurate indication that a row is unique in a single int, so we can put this to use to limit the amount of data you need to hold in memory:
1. Construct two HashSet<Integer> objects, called seen and seenTwice. Loop over your data, and add each array's hash to seen, and to seenTwice if it was already in seen, like so:
```
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(!seen.add(hash)) {
  // add returns false if we've already seen this hash
  seenTwice.add(hash);
}
```
2. Now we have a set of hashes that we saw two or more times; in theory, this will be a much smaller set than the number of rows in our file. We can let seen get garbage collected, and re-read the file using seenTwice to populate the HashSet<int[]> rows of actual data, like you were first trying to do:
```
int[] arr = ... // construct the row's array
int hash = Arrays.hashCode(arr);
if(seenTwice.contains(hash)) {
  // If the hash isn't in seenTwice, we know it's not a duplicate
  if(!rows.add(arr)) {
    System.out.println("Row "+Arrays.toString(arr))+" is a duplicate!");
  }
}
```
Use Bash: If you're willing to forgo Java, you can find duplicates very easily with a basic bash command:
```
cat filename | sort | uniq -d
```
Use a Database: You can, like you were alluding, use some out-of-memory solution, notably a database. A good, easy to use Java database is H2, but covering using it is outside the scope of this answer. Suffice to say, you could load your data from the file into a database, and then simply query for duplicate rows: Finding duplicate values in a SQL table

But setting up a DB simply to find duplicates in 50 million lines is overkill. I wouldn't reccomend this option.

See also: Script to find duplicates in a csv file

+1 (2) is a very nice write-up of the idea I just sketched in a comment. Might want to add a note about (according to your link) `uniq` only finding duplicates that are next to each other, though. — tobias_k, Jul 30 '14 at 14:17
Hm... but is it really a good idea to `sort` that file with 50 million lines? — tobias_k, Jul 30 '14 at 14:19
Bash's `sort` can sort extremely large amounts of data with ease. See [this question](http://stackoverflow.com/q/930044/113632) for more. — dimo414, Jul 30 '14 at 14:25

score 1 · Answer 2 · answered Jul 30 '14 at 12:58

1

I am not aware of what exactly you want to do. But would it be of help if you used a SQL database? Then you could save your values externally, and you would not need this large amount of RAM.

If this is not applicable to you, it's unfortunate. When I read your question, I was sure that using a db would solve all your problems.

answered Jul 30 '14 at 12:58

PKlumpp

4,913
8
36
64

Yep, use a database and use an index or primary key. – Olivier Grégoire Jul 30 '14 at 13:08

score -1 · Answer 3 · answered Jul 30 '14 at 13:15

-1

use phpmyadmin oe heideiSQL to upload data from CSV file

you can change the upload limit of phpmyadmin in .ini file Simple data insertion into database.

Fetch data from from database as java pojo objects and process it. Saves your memory

answered Jul 30 '14 at 13:15

theRoot

571
10
33

1

This is only marginally better than just posting "You could use databases to solve your problem." Please elaborate. – dimo414 Jul 30 '14 at 13:27

Duplicate detection on very large data

3 Answers3