35

I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.

It made me wonder what Vector is. I read the document and it says:

Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences. It is backed by a little endian bit-mapped vector trie with a branching factor of 32. Locality is very good, but not contiguous, which is good for very large sequences.

As with everything else about Scala, it's pretty vague. How actually does Vector work?

Luigi Plinge
  • 50,650
  • 20
  • 113
  • 180
Lai Yu-Hsuan
  • 27,509
  • 28
  • 97
  • 164
  • 1
    Continuing your statement *as everything else about Scala*, your question about Scala's Vector is very vague too. What exactly do you want to hear about vector? Highlevel description is pretty good, for low level details it makes sense to check out [sources](https://github.com/scala/scala/blob/v2.10.3/src/library/scala/collection/immutable/Vector.scala#L1). – om-nom-nom Dec 16 '13 at 14:10
  • 3
    Then again, if he's interested in knowing which kind of data structure is under the surface, the source won't help much (given the sparse comments) as you are basically expected to already know the algorithm and structure to make much sense of it. – Régis Jean-Gilles Dec 16 '13 at 14:17
  • 1
    Actually, that's pretty descriptive. There just aren't many ways to implement a bit-mapped vector trie with a branching factor of 32. Perhaps one could add that "mutating" an element create copies of the arrays in direct line from the element to the root, but there really isn't much one can said on that short of source code (which is available, of course). – Daniel C. Sobral Dec 16 '13 at 16:01
  • The current documentation (2.13.6) says, "Vectors are implemented by radix-balanced finger trees of width 32." – W. Zhu Aug 01 '21 at 07:42

3 Answers3

35

The keyword here is Trie. Vector is implemented as a Trie datastructure. See http://en.wikipedia.org/wiki/Trie.

More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:

https://bitbucket.org/astrieanna/bitmapped-vector-trie

The most relevant excerpt is:

A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.

UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:

I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".

If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant. Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case. Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.

But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.

Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows. Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).

Régis Jean-Gilles
  • 32,541
  • 5
  • 83
  • 97
  • If it's just a trie, the death is `log(32,N)` so the complexity should be `O(logN)` instead of `eC`, isn't it? – Lai Yu-Hsuan Dec 16 '13 at 14:41
  • @LaiYu-Hsuan the meaning of 'effectively constant time' is described in new [docs](http://docs.scala-lang.org/overviews/collections/concrete-immutable-collection-classes.html#vectors) – 4e6 Dec 16 '13 at 14:51
  • @RégisJean-Gilles (AFAICT there's no way that StackOverflow supports directing comments to accented names.) Would you clarify `AFAII this means that given that we have` ...? – Ed Staub Dec 16 '13 at 15:28
  • Sorry, that was just a left-over from an edit. I removed this sentence. – Régis Jean-Gilles Dec 16 '13 at 15:45
19

The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:

  • Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
  • Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
  • The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards

Why this? Simple: for performance.

Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.

:)

Glen Best
  • 22,769
  • 3
  • 58
  • 74
8

These may be interesting for you:

Community
  • 1
  • 1
sourcedelica
  • 23,940
  • 7
  • 66
  • 74