16

I've discovered that Haskell Data.Vector.* miss C++ std::vector::push_back's functionality. There is grow/unsafeGrow, but they seem to have O(n) complexity.

Is there a way to grow vectors in O(1) amortized time for an element?

Alec
  • 31,829
  • 7
  • 67
  • 114
Aleksandr Pakhomov
  • 753
  • 1
  • 5
  • 13
  • 3
    Maybe check out `Data.Sequence`? It has `O(1)` amortized for most things. What exactly do you need other than `push_back`? – Alec Jul 23 '15 at 21:34
  • Also, why can't you define your own `push_back` by growing the array to double its size every time it overflows? – Alec Jul 23 '15 at 21:37
  • 1
    It's good, but I use unboxed vectors for performance reasons. I need contiguous storage. Also I want random access for binary search. – Aleksandr Pakhomov Jul 23 '15 at 21:40
  • 2
    Alec, I can create a wrapper around Vector that grows exponentially, but I would be surprised if there is no second-useful Vector operation in Haskell. – Aleksandr Pakhomov Jul 23 '15 at 21:42
  • 3
    `vector` operations are subject to fusion and won't perform the way you assume coming from c++ in any case. You should benchmark – jberryman Jul 23 '15 at 23:04
  • std::vector::push_back is `O(n)` in general, it's only `O(1)` if it doesn't exceed the memory allocation size. – Jeremy List Jul 24 '15 at 05:56
  • 6
    @JeremyList Yes, but calling push_back `n` times is also `O(n)`, and not `O(n^2)`, due to exponential growing. Hence the _amortized_ complexity of one push_back is `O(1)`. – chi Jul 24 '15 at 07:59
  • 1
    You can only get amortized constant-time array doubling in a single-threaded context (e.g., with a *mutable* Haskell vector in `ST` or `IO`). In a persistent (pure) setting, I can make a vector one smaller than the threshold and add an element to that over and over, incurring the linear cost without having paid for it. See Okasaki's thesis or book to see how to deal with amortization in a persistent setting. – dfeuer Nov 19 '16 at 18:28

1 Answers1

14

No there really is no such facility in Data.Vector. It isn't too difficult to implement this from scratch using MutableArray like Data.Vector.Mutable does (see my implementation below), but there are some significant drawbacks. In particular, all of its operations end up happening inside some state context usually ST or IO. This has the downsides that

  1. Any code that manipulates such a data structure ends up having to be monadic
  2. The compiler is much less likely to be able to optimize. For example, libraries like vector use something really clever called fusion to optimize away intermediate allocations. This sort of thing is not possible in a state context.
  3. Parallelism is going to be a lot tougher: in ST I can't even have two threads and in IO I will have race conditions all over the place. The nasty bit here is that any sharing is going to have to happen in IO.

As if all this wasn't enough, garbage collection also performs better inside pure code.

What do I do then?

It isn't particularly often that you have a need for exactly this behaviour - usually you are better off using an immutable data structure (thereby avoiding all of the aforementioned problems) which does something similar. Just limiting ourselves to containers which comes with GHC, some alternatives include:

  • if you are almost always just using push_back, maybe you just want a stack (a plain old [a]).
  • if you anticipate doing more push_back than lookups, Data.Sequence gives you O(1) appending to either end and O(log n) lookup.
  • if you are interested in a lot of operations especially hashmap-like, Data.IntMap is pretty optimized. Even if the theoretical cost of those operations is O(log n), you will need a pretty big IntMap to start feeling those costs.

Making something like C++ vector

Of course, if one doesn't care about the restrictions mentioned initially, there is no reason not to have a C++ like vector. Just for fun, I went ahead and implemented this from scratch (needs packages data-default and primitive).

The reason this code is probably not already in some library is that it goes against much of the spirit of Haskell (I do this with the intent of conforming to a C++ style vector).

  • The only operation that actually makes a new vector is newVector - everything else "modifies" an existing vector. Since pushBack doesn't return a new GrowVector, it has to modify the existing one (including its length and/or capacity), so length and capacity have to be "pointers". In turn, that means that even getting the length is a monadic operation.
  • While this isn't unboxed, it would not be too difficult to replicate vectors data family approach - it is just tedious1.

With that said:

module GrowVector (
  GrowVector, newEmpty, size, read, write, pushBack, popBack
) where 

import Data.Primitive.Array
import Data.Primitive.MutVar
import Data.Default
import Control.Monad
import Control.Monad.Primitive (PrimState, PrimMonad)
import Prelude hiding (length, read)

data GrowVector s a = GrowVector
  { underlying :: MutVar s (MutableArray s a) -- ^ underlying array
  , length :: MutVar s Int                    -- ^ perceived length of vector
  , capacity :: MutVar s Int                  -- ^ actual capacity
  }

type GrowVectorIO = GrowVector (PrimState IO)

-- | Make a new empty vector with the given capacity. O(n)
newEmpty :: (Default a, PrimMonad m) => Int -> m (GrowVector (PrimState m) a)
newEmpty cap = do
  arr <- newArray cap def
  GrowVector <$> newMutVar arr <*> newMutVar 0 <*> newMutVar cap

-- | Read an element in the vector (unchecked). O(1)
read :: PrimMonad m => GrowVector (PrimState m) a -> Int -> m a
g `read` i = do arr <- readMutVar (underlying g); arr `readArray` i

-- | Find the size of the vector. O(1)
size :: PrimMonad m => GrowVector (PrimState m) a -> m Int
size g = readMutVar (length g)

-- | Double the vector capacity. O(n)
resize :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> m ()
resize g = do
  curCap <- readMutVar (capacity g)         -- read current capacity
  curArr <- readMutVar (underlying g)       -- read current array
  curLen <- readMutVar (length g)           -- read current length
  newArr <- newArray (2 * curCap) def       -- allocate a new array twice as big
  copyMutableArray newArr 1 curArr 1 curLen -- copy the old array over
  underlying g `writeMutVar` newArr         -- use the new array in the vector
  capacity g `modifyMutVar'` (*2)           -- update the capacity in the vector

-- | Write an element to the array (unchecked). O(1)
write :: PrimMonad m => GrowVector (PrimState m) a -> Int -> a  -> m ()
write g i x = do arr <- readMutVar (underlying g); writeArray arr i x

-- | Pop an element of the vector, mutating it (unchecked). O(1)
popBack :: PrimMonad m => GrowVector (PrimState m) a -> m a
popBack g = do
  s <- size g;
  x <- g `read` (s - 1)
  length g `modifyMutVar'` (+ negate 1)
  pure x

-- | Push an element. (Amortized) O(1)
pushBack :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> a -> m ()
pushBack g x = do
  s <- readMutVar (length g)                -- read current size
  c <- readMutVar (capacity g)              -- read current capacity
  when (s+1 == c) (resize g)                -- if need be, resize
  write g (s+1) x                           -- write to the back of the array
  length g `modifyMutVar'` (+1)             -- increase te length

Current semantics of grow

I think the github issue does a pretty good job of explaining the semantics:

I think the intended semantics are that it may do a realloc, but not guaranteed to, and all the current implementations do the simpler copying semantics because for on heap allocations the cost should be roughly the same.

Basically you should use grow when you want a new mutable vector of an increased size, starting with the elements of the old vector (and no longer care about the old vector). This is quite useful - for example one could implement GrowVector using MVector and grow.


1 the approach is that for every new type of unboxed vector you want to have, you make a data instance that "expands" your type into a fixed number of unboxed arrays (or other unboxed vectors). This is the point of data family - to allow different instantiations of a type to have totally different runtime representations, and to also be extensible (you can add your own data instance if you want).

Community
  • 1
  • 1
Alec
  • 31,829
  • 7
  • 67
  • 114