Check for unique line data from file with 5 millions lines in Java

Question

I have big file with row like ID|VALUE in one pass.

In case of ID repeat, line must be ignored.

How to effectively make this checking?
added: ID is long(8 bytes). I need a solution that uses minimum of memory.
Thank's for help guys. I was able to increase heap space and use Set now.

I tried to use Map for pairs storing and got "out of memory exception" — Sarge, Jul 06 '11 at 10:15
it is impossible in this project. I can't change java options. — Sarge, Jul 06 '11 at 12:51

Peter Lawrey · Answer 1 · 2011-07-06T11:02:56.360

4

You can store the data in a TLongObjectHashMap or use a TLongHashSet. These classes store primitive based information efficiently.

5 million long values will use < 60 MB in a TLongHashSet however a TLongObjectHashMap will also store your values efficiently.

To find out more about these classes

http://www.google.co.uk/search?q=TLongHashSet

http://www.google.co.uk/search?q=TLongObjectHashMap

edited Jul 06 '11 at 11:02

answered Jul 06 '11 at 10:11

Peter Lawrey

525,659
79
751
1,130

1

Classes are from the [trove library](http://trove.starlight-systems.com/), by the way – Andreas Dolk Jul 06 '11 at 11:00
I can't use additional libriries. – Sarge Jul 06 '11 at 13:32

Andreas Dolk · Accepted Answer · 2011-07-06T12:52:49.597

2

You'll have to store ID's somewhere anyway in order to detect duplicates. Here I'd use a HashSet<String> and its contains method.

edited Jul 06 '11 at 12:52

answered Jul 06 '11 at 09:58

Andreas Dolk

113,398
19
180
268

ID is long(8 bytes). Storing 5 millions long in HashSet takes a lot of memory. – Sarge Jul 06 '11 at 10:05

score 2 · Answer 3 · answered Jul 06 '11 at 09:58

2

You have to read the entire file, one line at a time. You have to keep a Set of IDs and compare the incoming one to the values already in the Set. If a value appears, skip that line.

You wrote the use case yourself; there's no magic here.

answered Jul 06 '11 at 09:58

duffymo

305,152
44
369
561

I need a solution that uses minimum of memory. Set is not suitable – Sarge Jul 06 '11 at 10:11

score 2 · Answer 4 · answered Jul 06 '11 at 11:06

2

This looks like a typical database task to me. If you have a database used in your app, you could utilize that to do your task. Create a table with a UNIQUE INTEGER field and start adding rows; you'll get an exception on the duplicated IDs. The database engine will take care of cursor windowing and caching so it fits in your memory budget. Then just drop that table when you're done.

answered Jul 06 '11 at 11:06

JBM

2,930
2
24
28

I put values to table. I use executeBatch for inserting and can't just catch exception and ignore it. – Sarge Jul 06 '11 at 13:38
1

Hm, if you add values to a table anyway, why not letting the database do the job? Instead of doing a batch insert I would start a transaction, nest a loop with `try` inside it which simply ignores exceptions on duplicate IDs, and commit the transaction in the end. Inserting inside a transaction works very fast. I would say it may be slower than a batch insert but it's certainly faster than pre-filtering duplicates (not to mention much more memory-effective, too). – JBM Jul 07 '11 at 20:00

score 2 · Answer 5 · answered Jul 06 '11 at 11:19

There are two basic solutions;

First, as suggested by duffymo and Andreas_D above you can store all the values in a Set. This gives you O(n) time complexity and O(n) memory usage.

Second, if O(n) memory is too much, you can do it in O(1) memory by sacrificing speed. For each line in the file, read all other lines before it and discard if the ID appears before the current line.

score 1 · Answer 6 · answered Jul 06 '11 at 10:50

1

What about probabilistic algorithms?

The Bloom filter ... is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not.

answered Jul 06 '11 at 10:50

Mikhail

321
2
3

Check for unique line data from file with 5 millions lines in Java

6 Answers6