1

I am new to Java and I do not know the differences between java collection implementations.

I have to process up to 100K records of imported data. There might be duplicates on that list. I have to put all that into DB. Before import I clean the database table, so there are no duplicates in DB at the beginning.

A am batch inserting the data with hibernate. I want to do something like this:

SomeCollectionClass<Integer> alreadyInsertedRecords;
//...
if (!alreadyInsertedRecords.contains(currentRecord.hashCode()) {
    save_to_database(currentRecord);
    alreadyInsertedRecords.put(currentRecord.hashCode());
} else {
    logger.log("Record no 1234 is a duplicate, skipping");
}

Which collection class should I use to check if the record is has been inserted to the db?

As I said there might be more than 100 000 records, so the collection should be fast to search, fast to insert and have small memory footprint.

Andrew Thompson
  • 168,117
  • 40
  • 217
  • 433
SWilk
  • 3,261
  • 8
  • 30
  • 51

4 Answers4

2

You could try with a HashSet. Remember that the contained objects' class has to implement properly the methods hashCode() and equals().

Alberto Segura
  • 755
  • 1
  • 12
  • 28
1

If the entries are sortable you can use the TreeSet collection which will automatically prune all duplicate entries provided they have valid compareTo() and equals() methods implemented.

This collection also provides guaranteed log(n) time cost for the basic operations (add, remove and contains). [reference]

If you have access to the hashCode() function, then you can use HashSet. It will work similarly as the TreeSet (prune dupes on insert) and it will be faster.

Colsult Hashset vs Treeset question for details on both of those collections.

If possible, use HashSet.

Community
  • 1
  • 1
Dariusz
  • 21,561
  • 9
  • 74
  • 114
  • 2
    I think HashSet is faster because it provides [constant time](http://en.wikipedia.org/wiki/Time_complexity#Constant_time) access. – Alberto Segura May 24 '13 at 10:41
  • I want to store only Integer identifiers, not whole objects, which actually are generated with `.hashCode()` method. It seems that TreeSet is the way to go. – SWilk May 24 '13 at 10:48
  • Ok, I've just reloaded the page. Seems that HashSet is better, and Integer has hashCode() method, so HashSet it is. – SWilk May 24 '13 at 10:51
1

If you don't want duplicates, you can use

Set<Integer> alreadyInsertedRecords = new HashSet<Integer>()
Petros Tsialiamanis
  • 2,718
  • 2
  • 23
  • 35
0

I would not use a collection for this as it can be done at the database level. You can use an insert where not exists statement.

Eg

insert into people (firstName, lastName) 
select 'Foo', 'Bar'
where not exists (
    select 1 from people where firstName = 'Foo' and lastName = 'Bar'
)
lance-java
  • 25,497
  • 4
  • 59
  • 101
  • Don't you think that checking if a given object is a duplicate in the application's memory is more effective in terms of performance then invoking a database query? – Adam Siemion May 24 '13 at 11:00
  • As long as the column is indexed, it's the same. It's just a java index lookup vs a database index lookup. Both should be constant time O(1) – lance-java May 24 '13 at 11:03
  • Actually, thinking about this again you're right. There's extra network traffic involved. But the bonus is you don't need to keep 100,000 records in memory in the JVM – lance-java May 24 '13 at 11:05
  • Yes, that is what I had in mind; actually it is not only network traffic, but also setting query parameters, query parsing on the database and client side, adding the query into the redo log on the database side and a lot more less costly operations; my rough guess is that checking it on the application side is at least 10x less costly then on database side – Adam Siemion May 24 '13 at 11:20
  • You should not really think about it like that. The worst case scenario is that for n records, you will do n inserts to the database [complexity = O(n)]. Even if you manage to avoid half of the inserts (n/2) by checking a collection in the JVM, it is still a complexity of O(n) since O(n/2) = O(n). http://en.wikipedia.org/wiki/Time_complexity. Therefore, I would choose the solution which avoids storing the collection in memory since both solutions have a complexity of O(n). – lance-java May 24 '13 at 12:10
  • ok... let's assume that to process one record the algorithm needs 1 minute, so what you are saying is that it does not matter wether it takes 100 minutes, when n is 100, or 50 minutes? I think any further discussion regarding that is pointless. **Complexity is not all that matters**. – Adam Siemion May 24 '13 at 12:25
  • Yes, under your circumstances the in memory solution is 50 minutes faster and that's nothing to be sneezed at if that is the case. I think that optimisations like this should only be done later, when you are seeing that the process is slow. Until that time, I would choose the more scalable approach which avoides a big chunk of memory. – lance-java May 24 '13 at 13:20