9

Curiosity and efficiency are the reasons for this question. I am in a situation where I am creating many new HashSets after certain loops run:

The HashSet is currently declared as such at the top of the class:

private Set<String> failedTests;

Then later in the code, I just create a new failedTests HashSet whenever I am re-running the tests:

failedTests = new HashSet<String>(16384);

I do this over and over, depending on the size of the test. I expect the garbage collector to most efficiently handle the old data. But, I know another option would be to create the HashSet initially in the beginning:

private Set<String> failedTests = new HashSet<String>(16384);

and then clear the HashSet each time through the loop.

failedTests.clear();

My question is which is the most efficient way of doing this in terms of overhead, etc? I don't know what the clear() function is doing inside -- is it doing the same thing, sending the old data to the garbage collection, or is it doing something even more efficient? Also, I am giving the HashSet a large cushion of initial capacity, but if a test requires more than 2^14 elements, will the .clear() function re-instantiate the HashSet to 16384?

To add, I found the source code to clear() here. So it is at least an O(n) operation of the worst case.

Using the clear function, I did a test process which finished in 565 seconds. Using the GC to handle it, the test finished in 506 seconds.

But its not a perfect benchmark because there are other external factors such as interfacing with the computer's and network's file system. But a full minute does feel pretty good indeed. Does anyone recommend a specific profiling system that will work on the line/method level? (I am using Eclipse Indigo)

E.S.
  • 2,733
  • 6
  • 36
  • 71
  • Have you tried benchmarking it? – rob Jun 17 '13 at 20:02
  • Do you have any measure as to how *many* new sets are you creating? Did you actually test your application's behaviour? It's a case of the *memory vs performance* question that oftenly leads to premature optimization. As a base you can create a new `HashSet`, allow GC to do its job and do a little profiling to see the real times before worrying. After all, the `clear` method involves an iteration, nulling references and allowing the GC to do his job anyway. – Fritz Jun 17 '13 at 20:04
  • possible duplicate of [Fastest way to recreate the ArrayList in a for loop](http://stackoverflow.com/questions/11740013/fastest-way-to-recreate-the-arraylist-in-a-for-loop): `new` is generally faster than `clear`. – assylias Jun 17 '13 at 20:17
  • The hashset is used through the testing and I notice very little difference in timing, but do you recommend a specific profiling solution that will time each line of code/method call? I will publish the profiling in the end anyway, but there are many other factors in my program that can cause delays, such as file reads... – E.S. Jun 17 '13 at 22:58

2 Answers2

6

I don't know what the clear() function is doing inside

It is calling the clear() method of HashMap table that it is using internally. Within HashMap the clear() method is defined as follows:

public void clear() {
  modCount++;
  Entry[] tab = table;
  for (int i = 0; i < tab.length; i++)
      tab[i] = null;
  size = 0;
}

is it doing the same thing, sending the old data to the garbage collection, or is it doing something even more efficient?

tab[i] = null points out that it is making the old data eligible for the garbage collection.

Also, I am giving the HashSet a large cushion of initial capacity, but if a test requires more than 2^14 elements, will the .clear() function re-instantiate the HashSet to 16384?

No, It won't.

which is the most efficient way of doing this in terms of overhead, etc?

I guess, Java Garbage collector knows how to do its work in most efficient way. So let the garbage collector to take care of this. So, I would prefer creating a new failedTests HashSet each time it is needed.

Vishal K
  • 12,976
  • 2
  • 27
  • 38
  • 2
    Large objects go straight into tenured space, so it's more expensive to GC them than it is to GC smaller objects in the nursery generation. Still, this cost pales in comparison to the cost of iterating through all 16000 elements of the backing array. – Zim-Zam O'Pootertoot Jun 17 '13 at 20:13
5

recreating the HashSet is more efficient.

1) if HashSet capacity grew above 16384 clear will not reset it to initial capacity

2) new HashSet(16384) creates a new Entry[16384] array, it's one operation, it's more efficient than nulling elements one by one like clear does

for (int i = 0; i < table.length; i++)
    tab[i] = null;
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275