HashSet . slow performance in big set

Question

I have encountered a problem i cannot find a solution. I am using a HashSet to store values. The values I store is of the custom type Cycles where i have overridden the HashCode and equals as following in order to make sure the slow performance is not cuased by the hascode or the equal methods Also i have set the initial capacity of the hashset to 10.000.000

@Override
public int hashCode() {
 final int prime = 31;
 int result = 1;
 result = prime * result + (int) (cycleId ^ (cycleId >>> 32));
 return result;
}

@Override
public boolean equals(Object obj) {
 if (this == obj)
 return true;
 if (obj == null)
 return false;
 if (getClass() != obj.getClass())
 return false;
 Cycle other = (Cycle) obj;
 if (cycleId != other.cycleId)
 return false;
 return true;
}

After the first 1.500.000 first values when i try to add a new value (with the add method of the HashSet class) the program is very slow. Eventually i am going to have java out of memory exception (Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space) before the stored values reach the 1.600.000

The IDE i use is Eclipse. So the next step was to increase the JVM heap size from the default value to 1 giga (using the commnads Xmx1000M and Xms1000M) Now the elipse starts with 10 times more memory available (i can see that in the bottom right where the total heap size memory and used memory is shown) but again i have the same "slow" performance and the same out of memory error IN THE SAME VALUES as before (after the 1.500.000 and before 1.600.000) which is very odd.

Does anyone has an idea what it might be the problem?

Thank you in advance

What is cycleId exactly? If it is an ID as in identity and therefore unique for the cycles then just return cycleId as hashcode. If it is not an Integer, then take hashCode of what type it is. If it is a 64 bit and the ID's is starting from 0 (with an even distribution or most in the lower 32 bit) then cast it to int. — Lasse Espeholt, Jul 25 '10 at 12:24
@lasseespeholt, why? Then the hashcode would only depend on the lower 32 bits of the long! Using *all* the bits is the way to go. Imagine what kind of disaster would happen if String.hashCode() used only the last couple of characters to make a 32 hashCode! — Dimitris Andreou, Jul 25 '10 at 12:43
Have you profiled your program to verify that it's the `HashSet` that is slowing things down? — Jack Leow, Jul 25 '10 at 13:04
@dimitris I write explicit that he only should do it with an even distribution of ID's. In that case it would be fine :) Or if only ex 33 bits is actually used then the hashcode could be optimized instead of using a standard 64 bit -> 32 bit hashcode. Your string analogy is not very good I think. If we said we always had a 4 character ASCII string, then it would also be perfectly fine to use the lower 32 bits :) It all depends on what the distribution of Id's is. Sorry, if my poor English does not explain me properly. — Lasse Espeholt, Jul 25 '10 at 13:43
Actually, it was a bad example by me with the 33 bit. I see Java Long's hashCode() is: (int)(this.longValue()^(this.longValue()>>>32)) which will give a negative number with the 33'nth bit so it should be perfectly fine to use :) My point is that before using a standard hashCode() you should look at how the distribution of values is. There does not exists a hashCode() which is good for all distributions if we compress more than 32 bit into 32 bit. So in some cases a custom is needed. — Lasse Espeholt, Jul 25 '10 at 14:02
"There does not exists a hashCode() which is good for all distributions if we compress more than 32 bit into 32 bit." Sure enough. But efficient problem solving dictates we start spend our time on the most probably possible cause, rather than a remote theoretical possibility. — Dimitris Andreou, Jul 25 '10 at 14:43
Agreed. All these comments turned out not to be the problem. Thanks for clearing that out and giving useful feedback :) — Lasse Espeholt, Jul 25 '10 at 16:22

score 10 · Accepted Answer · answered Jul 25 '10 at 12:56

10

You don't want to increase the JVM heap for Eclipse, you want to set it for your program.

Go to Run > Run Configurations (or Debug Configurations) and set the VM Options there.

answered Jul 25 '10 at 12:56

Devon_C_Miller

16,248
3
45
71

Dimitris Andreou · Answer 2 · 2010-07-25T14:46:57.093

4

Not enough heap memory (increase it via -Xmx, e.g. -Xmx512m). When free memory goes very low, then much, much time is spent by the garbage collector which furiously scans the heap for unreachable objects.

Your hashCode() is fine, extra points for using all bits of the cycleId long.

Edit. Now I saw you did increase the memory, and didn't help. First of all, are you sure you did manage to increase the memory? You could check this by jconsole, connect to your app and see its heap size.

For an alternative explanation to be verified, is there any particular pattern in your cycleId that could make this hashCode() implementation bad? Like, its 32 high order bits are mostly similar to the 32 low order bits. (Yeah, right).

But no. Even if that would be the case, you would be seeing a gradual degradation of performance, not a sharp drop at a specific point (and you do get a OutOfMemoryError and frenzy gc operation). So my best guess is still a memory issue. You either didn't increase the heap size as you thought, or there is some other code grabbing memory at some point. (You could use a tool like VisualVM to profile this, and get a heap dump upon OOME, and see what objects it contains).

Edit2 I made bold the correct part of the above.

edited Jul 25 '10 at 14:46

answered Jul 25 '10 at 12:22

Dimitris Andreou

8,806
2
33
36

well.. the hachcode method is automatically generated by eclipse ...so extra points to eclipse :) As far as the Xmx i have inreased it but strangly i have the same problem. The odd thing is that even i have increased the heap memory the programs starts to get slow in the same amount of data as before.... – C.LS Jul 25 '10 at 12:53
(Also see the rest of the answers that suggest you didn't increase the heap of your app, but of eclipse itself.) – Dimitris Andreou Jul 25 '10 at 12:58
Why extra points to eclipse for generating the hashCode() method when it could have used Long.hashCode() ? – Lasse Espeholt Jul 25 '10 at 14:10
Well the problem was 2 sided.. First i increased (as i should have) the eclipses heap size. Then i had to go to the run configuration and do the same for my program. Second, the heap size usage shown at the bottom right is the heap size the Eclipse uses and not my applications! Now i have done all the corrections necessary and indeed i can see that my application is consuming all the memory (2GB of heap). Which makes me suicidal because now i have to use another approach. How can i store this information to disk instead of memory but be able to access specific info in constant time? Thanks! – C.LS Jul 25 '10 at 14:24
@lasseespeholt, two points. Long.hashCode() is exactly as the one eclipse generated. Also, to use Long.hashCode(), you have to instantiate a Long object, which is not needed if you already have the long primitive. – Dimitris Andreou Jul 25 '10 at 14:37
@Pitelk, glad you resolved your (first) problem (not an eclipse user here). The second question is definitely non-trivial, you should make another post for that (that is, if you can't find an easy way to reduce the memory footprint/cut some low hanging fruit or what not). – Dimitris Andreou Jul 25 '10 at 14:39
@Pitelk, you could also use a database, for a trivial solution. Of course random accesses on disk would be *much, much* slower (not even the cache would offer any benefit with these). – Dimitris Andreou Jul 25 '10 at 14:49

score 2 · Answer 3 · answered Jul 25 '10 at 11:59

2

A memory size available for the application you start from Eclipse should be configured from the Run menu. Try:

Run -> Run Configurations -> Arguments -> VM Arguments -> -Xmx1000M

The reason why your program is slow is Garbage Collector - it starts each time a memory is going to be out of the limit.

answered Jul 25 '10 at 11:59

Vitalii Fedorenko

110,878
29
149
111

i have set the VM to 1G and it only uses less than 100M. the program behaves the same with a VM of 50 M or a VM of 1G. Strange enough i have an memory full error when the set has about 1500.000 and 1.600.000 elements indepedently of how much memroy i have set. The heap size memory in Eclipse is shown at the bottom right corner so i have double checked that the Xmx and Xms commands work correct.. – C.LS Jul 25 '10 at 12:59
Thank you Vitalii. I checked the jconsol and turned out that you are 100% right. The program uses all the memory while eclipse shows only 50MB out of the 1000MB. Also Andreas is right.. i have to declare explicitly to the lunch configuration that i want the program to use all the heap space independent of the fact i have started eclipse with 1000M heap. Thanks for your help – C.LS Jul 25 '10 at 13:32

score 2 · Answer 4 · answered Jul 25 '10 at 12:02

2

Have you tested your hashCode method implementation? it always returns 31, for any value of circleId. Not strange that your HashMap works slow, it has a linear performance.

answered Jul 25 '10 at 12:02

Roman

64,384
92
238
332

@Dimitris Andreou: I read it correctly. It **always** returns 31 because `(X ^ (X >>> 32))` always returns `0` for ints. – Roman Jul 25 '10 at 12:52
It's obviously a long. (Note the explicit cast to int). – Dimitris Andreou Jul 25 '10 at 13:00
@Dimitris Andreou: if we suppose that cycleId is long (what is not the fact yet) then anyway I agreed with lasseespeholt's comment about hashCode implementation, it can contain `2 ^ 32` unique values which is enough to map 16 GB of integers on unique keys. – Roman Jul 25 '10 at 13:01
But if it's long, and my answer is unuseful as well as yours then I don't know how to help. – Roman Jul 25 '10 at 13:02
So your answer was addressing the astronomical improbability that someone would be shifting an int 32 positions, *and* then would type **four** parentheses to accommodate a cast that wasn't needed in the first place. Alright! (By the way, you meant 4GB integers). – Dimitris Andreou Jul 25 '10 at 13:06
Using all bits is also a recommendation of Effective Java, and also note that this was generated by eclipse. Perhaps you and lasseespeholt should be pointing this out to eclipse developers :) – Dimitris Andreou Jul 25 '10 at 13:07
@Dimitris: it's not related, but just curious: why 4GB? I calculated that 2 ^ 32 possible values = 2 ^ 32 mappings. Each int is 4b, so total weight is 2 ^ 32 * 4, what is 16 GB. – Roman Jul 25 '10 at 13:10
Oh, right, I thought you were counting the integers, not measuring their memory (but then, that what the 'B' of 'GB' means :)). – Dimitris Andreou Jul 25 '10 at 13:13

score 1 · Answer 5 · answered Jul 25 '10 at 12:01

1

If you want to increase the memory your program can use it won't help to increase Eclipse's heap size. You must put the parameter into the launch configuration's vm parameters of your program.

answered Jul 25 '10 at 12:01

Andreas

3,929
2
25
22

Thank you Andreas! you are right about that.. i have to declare explicitly to the lunch configuration that i want the program to use all the heap space independent of the fact i have started eclipse with 1000M heap. Thanks for your help – C.LS Jul 25 '10 at 13:33

score 1 · Answer 6 · answered Jul 25 '10 at 12:09

1

JVM throws 'out of memory' NOT based on available memory. It is thrown when time being spent on the garbage collection is too much. check this. Exact implementation details vary based on JVM and the garbage collector implementation.

Increasing memory would not help in this case. You may have to choose another approach.

answered Jul 25 '10 at 12:09

Gopi

10,073
4
31
45

Maybe you are right. But the thing is that i have the heap size free space and used space shown at the bottom right of eclipse so i can see when the garbage collector is running as when it does lot of memory resource is freed. Also the memory used when the error pops up is 150M out of the 1000M .As far as i know the garbage collector doesnt run until more memory is needed where here this is not the case. Thanks for your answer – C.LS Jul 25 '10 at 13:04
turns out to be that eclise virtual memory is not the same of the virtual memory my program uses.. I have to set eclipse's virtual memory to 1G but then i have to do the same in the run configuration. Also the displayed heap size refers to eclipse's heap size used and not my applications. So in the aftermath.. my program uses indeed all the memory. – C.LS Jul 25 '10 at 13:56

score 0 · Answer 7 · answered Apr 24 '14 at 23:48

I'm pretty disappointed at the number of answers telling the OP to increase his heap size in his application. That's not a solution--that's a quick-and-dirty patch, which won't address any underlying problem.

I found this presentation extremely informative: http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf

Mainly the page listing the minimal byte sizes of each when empty--

ArrayList: 40 or 48
LinkedList: 48
HashMap: 56 or 120
HashSet: 72 or 136

Turns out that a HashSet is practically a HashMap, and (counterintuitively) takes up more more memory despite holding only values instead of key-value pairs.

score 0 · Answer 8 · answered Jul 25 '10 at 11:54

0

Maybe your computer doesn't have enough memory, hence it has to swap to disk.

answered Jul 25 '10 at 11:54

BobTurbo

1,065
1
8
16

score 0 · Answer 9 · answered Jul 25 '10 at 12:36

How are you initializing your HashSet? You need to be aware of its growth pattern. With every add operation, it checks whether it is getting close to capacity. If it reaches a certain point (determined by its 'load factor'), it performs a resizing operation which can be expensive. From the JavaDoc (of HashMap - the collection that backs HashSet):

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

the initial capacity is set to 10.000.000 and the programs starts to be slow at 1.500.000 so this propably is not the case. Thanks for your anwser — C.LS, Jul 25 '10 at 12:54

HashSet . slow performance in big set

9 Answers9