18

Ok, here's my situation:

I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.

However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?

From googling, I have come up with:

String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.

So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.

I guess my questions really are:

  • What should you set the initial capacity to be when you don't know how many items are to be in the Set?
  • Does it really matter what it gets set to when the most it could contain is 50?
  • Should I even be worrying about it?
Clarkey
  • 1,553
  • 5
  • 22
  • 34

7 Answers7

20

Assuming that you know there won't be more than 50 states (do you mean US States?), the

Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.

I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.

So the best answer is probably to use:

new HashSet<String>();

That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.

Zarkonnen
  • 22,200
  • 14
  • 65
  • 81
  • 4
    +1 for the "come back a year later and puzzle". Always happens to undocumented code – Rishi Dua Jul 01 '14 at 11:44
  • 1
    but i think in practice this value will be rounded up to nearest power of two (in this case 128)! which is way too much for our need – Mr.Q Apr 03 '17 at 07:55
7

Use the constructor where you don't need to specify these values, then reasonable defaults are chosen.

starblue
  • 55,348
  • 14
  • 97
  • 151
3

First, I'm going to say that in your case you're definitely overthinking it. However, there are probably situations where one would want to get it right. So here's what I understand:

1) The number of items you can hold in your HashSet = initial capacity x load factor. So if you want to be able to hold n items, you need to do what Zarkonnen did and divide n by the load factor.

2) Under the covers, the initial capacity is rounded up to a power of two per Oracle tutorial.

3) Load factor should be no more than .80 to prevent excessive collisions, as noted by Tom Hawtin - tackline.

If you just accept the default values (initial capacity = 16, load factor = .75), you'll end up doubling your set in size 3 times. (Initial max size = 12, first increase makes capacity 32 and max size 24 (32 * .75), second increase makes capacity 64 and max size 48 (64 * .75), third increase makes capacity 128 and max size 96 (128 * .75).)

To get your max size closer to 50, but keep the set as small as possible, consider an initial capacity of 64 (a power of two) and a load factor of .79 or more. 64 * .79 = 50.56, so you can get all 50 states in there. Specifying 32 < initial capacity < 64 will result in initial capacity being rounded up to 64, so that's the same as specifying 64 up front. Specifying initial capacity <= 32 will result in a size increase. Using a load factor < .79 will also result in a size increase unless your initial capacity > 64.

So my recommendation is to specify initial capacity = 64 and load factor = .79.

Community
  • 1
  • 1
Erick G. Hagstrom
  • 4,873
  • 1
  • 24
  • 38
1

The safe bet is go for a size that is too small.

Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.

Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.

Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
0

If you were to optimize this -- and it may be appropriate to do that -- some of your decision will depend on how many duplicates you expect the array to have.

  • If there are very many duplicates, you will want a smaller initial capacity. Large, sparse hash tables are bad when iterating.

  • If there are not expected to be very many duplicates, you will want an initial capacity such that the entire array could fit without resizing.

My guess is that you want the latter, but this is something worth considering if you pursue this.

Paul Draper
  • 78,542
  • 46
  • 206
  • 285
0

Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).

tehvan
  • 10,189
  • 5
  • 27
  • 31
0

I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.

Jeroen van Bergen
  • 1,046
  • 6
  • 7