3

I have big file with row like ID|VALUE in one pass.

In case of ID repeat, line must be ignored.

How to effectively make this checking?
added: ID is long(8 bytes). I need a solution that uses minimum of memory.
Thank's for help guys. I was able to increase heap space and use Set now.

Sarge
  • 79
  • 6

6 Answers6

4

You can store the data in a TLongObjectHashMap or use a TLongHashSet. These classes store primitive based information efficiently.

5 million long values will use < 60 MB in a TLongHashSet however a TLongObjectHashMap will also store your values efficiently.

To find out more about these classes

http://www.google.co.uk/search?q=TLongHashSet

http://www.google.co.uk/search?q=TLongObjectHashMap

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
2

You'll have to store ID's somewhere anyway in order to detect duplicates. Here I'd use a HashSet<String> and its contains method.

Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
2

You have to read the entire file, one line at a time. You have to keep a Set of IDs and compare the incoming one to the values already in the Set. If a value appears, skip that line.

You wrote the use case yourself; there's no magic here.

duffymo
  • 305,152
  • 44
  • 369
  • 561
2

This looks like a typical database task to me. If you have a database used in your app, you could utilize that to do your task. Create a table with a UNIQUE INTEGER field and start adding rows; you'll get an exception on the duplicated IDs. The database engine will take care of cursor windowing and caching so it fits in your memory budget. Then just drop that table when you're done.

JBM
  • 2,930
  • 2
  • 24
  • 28
  • I put values to table. I use executeBatch for inserting and can't just catch exception and ignore it. – Sarge Jul 06 '11 at 13:38
  • 1
    Hm, if you add values to a table anyway, why not letting the database do the job? Instead of doing a batch insert I would start a transaction, nest a loop with `try` inside it which simply ignores exceptions on duplicate IDs, and commit the transaction in the end. Inserting inside a transaction works very fast. I would say it may be slower than a batch insert but it's certainly faster than pre-filtering duplicates (not to mention much more memory-effective, too). – JBM Jul 07 '11 at 20:00
2

There are two basic solutions;

First, as suggested by duffymo and Andreas_D above you can store all the values in a Set. This gives you O(n) time complexity and O(n) memory usage.

Second, if O(n) memory is too much, you can do it in O(1) memory by sacrificing speed. For each line in the file, read all other lines before it and discard if the ID appears before the current line.

Qwerky
  • 18,217
  • 6
  • 44
  • 80
1

What about probabilistic algorithms?

The Bloom filter ... is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not.

Mikhail
  • 321
  • 2
  • 3