I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.
I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.
I wrote the following test code to compare the performance of ListSet and HashSet for add operations:
import scala.collection.immutable._
object TestListSet extends App {
var set = new ListSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
object TestHashSet extends App {
var set = new HashSet[Int]
for(i <- 0 to 100000) {
set += i
}
}
Here is a rough runtime measurement of the HashSet:
$ time scala TestHashSet
real 0m0.955s
user 0m1.192s
sys 0m0.147s
And ListSet:
$ time scala TestListSet
real 0m30.516s
user 0m30.612s
sys 0m0.168s
Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".
As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.