Performance of immutable set implementations in Scala

Question

I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.

I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.

I wrote the following test code to compare the performance of ListSet and HashSet for add operations:

import scala.collection.immutable._

object TestListSet extends App {
  var set = new ListSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

object TestHashSet extends App {
  var set = new HashSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

Here is a rough runtime measurement of the HashSet:

$ time scala TestHashSet

real    0m0.955s
user    0m1.192s
sys     0m0.147s

And ListSet:

$ time scala TestListSet

real    0m30.516s
user    0m30.612s
sys     0m0.168s

Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".

As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.

score 10 · Answer 1 · answered Jun 18 '13 at 20:20

An old question but also a good example of conclusions being drawn on the wrong foundation.

Connor, basically you're trying to do a microbenchmark. That is generally not recommended and damn hard to do properly.

Why? Because the JVM is doing many other things than executing the code in your examples. It's loading classes, doing garbage collection, compiling bytecode to native code, etc. All dynamically and based on different metrics sampled at runtime.

So you cannot conclude anything about the performance of the two collections with the above test code. For example, what you could actually be measuring could be the compilation time of the += method of HashSet and garbage collection times of ListSet. So it's a comparison between apples and pears.

To do a micro benchmark properly, you should:

Warm up the JVM: Load all classes, ensure all code paths in the benchmark are run and hot spots in the code are compiled (e.g. the += method).
Run the benchmark and ensure neither the GC or the compiler runs during the test (use the JVM flags -XX:-PrintCompilation and -XX:-PrintGC). If either runs during the test, discard the result.
Repeat step 2 and sample 10-15 good measurements. Calculate variance and standard deviation.
Evaluate: If the mean of each benchmark +/- 3 std do not overlap, then you can draw a conclusion about which is faster. Otherwise, it's a blurry result (depending on the amount of overlap).

I can recommend reading Oracle's recommendations for doing micro benchmarks and a great article about benchmark pitfalls by Brian Goetz.

Also, if you want to use a good tool, which does all the above for you, try Caliper by Google.

score 9 · Accepted Answer · answered Aug 04 '11 at 20:17

9

The key line from the ListSet source is (within subclass Node):

override def +(e: A): ListSet[A] = if (contains(e)) this else new Node(e)

where you can see that an item is added only if it is not already contained. So adding to the set is O(n). You can generally assume that XMap has similar performance characteristics to XSet, and ListMap is listed as linear time all the way along. This is why, and it is how a set is supposed to behave.

P.S. In the TestHashSet case you're measuring startup time. It's way more than 30x faster.

answered Aug 04 '11 at 20:17

Rex Kerr

166,841
26
322
407

This makes sense, that's what I was attributing the extra `0(n)` to. There are a lot of articles on the net touting "persistence" as a panacea but we can't forget basic complexity. Thanks Rex! – Connor Doyle Aug 04 '11 at 20:26
1

scala.collection.immutable.HashSet is the default implementation when you create an immutable set, and here you see why. I wonder in which case one would explicitly use ListSet? – ebruchez Aug 04 '11 at 20:49
I'm thinking it would be useful in very memory-scarce environments. In that case however, one probably wouldn't be targeting the JVM. – Connor Doyle Aug 04 '11 at 22:01
By the way Rex, you were right about the speed difference. I threw some more precise timing code around the loops and it looks like `HashSet` was about 130x faster than `ListSet` over the 100k iterations on my machine. – Connor Doyle Aug 04 '11 at 22:23

score 6 · Answer 3 · answered Aug 04 '11 at 20:19

6

Since a set has to have with no duplicates, before adding an element, a Set must check to see if it already contains the element. This search in a list that has no guarantee of an element's position will be O(N) linear time. The same general idea applies to its remove operation.

With a HashSet, the class defines a function that picks a location for any element in O(1), which makes the contains(element) method much quicker, at the expense of taking up more space to decrease the chance of element location collisions.

answered Aug 04 '11 at 20:19

Dylan

13,645
3
40
67

1

Thanks Dylan, thinking about the implementation, this makes sense to me. I think my confusion was related to the documentation. I just hadn't quite thought through "this set is backed by a List, and sets can't contain duplicates, so adds are going to be slow". I thought instead: "cons is fast, and a list is going to use less memory than a hash table". My mistake, but could have been avoided with a note to that effect in the scaladoc for the `ListSet` class. – Connor Doyle Aug 04 '11 at 20:54

Performance of immutable set implementations in Scala

3 Answers3