5

Apologies for what seems like a basic question. I'm trying to figure out how to convert a Data.Text into a vector of Char. The best I could come up with is this:

textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
   append c vec = V.snoc vec c

This seems somewhat complicated (and computationally inefficient?) Is there a better way?

More generally, as a Haskell newbie, I'd appreciate any advice on how to go about figuring out this sort of thing for myself. I wasn't able to make much progress from either Google or searching through the documentation.

happydave
  • 7,127
  • 1
  • 26
  • 25
  • 1
    I don't know of any pre-existing functions that would do this for you. For what it's worth, if you need it to be efficient I'd probably approach it much as you would in C: preallocate a `Vector` of the correct length, `traverse` the characters of the `Text` and write them into the `Vector`, in the `ST` monad – Benjamin Hodgson Dec 29 '16 at 00:27
  • Complicated? This program is equivalent to `T.foldr (flip V.snoc) V.empty` which contains only four identifiers - if you call this a complicated program, please give an example of a 'simple' one. Computationally inefficient? Have you profiled this program and determined it is too slow for your needs? Otherwise, do not worry about prematurely optimizing your program. The designers of these libraries put great effort into making such programs efficient. (As an aside, your program reverses the string - `T.foldr V.cons V.empty` gets you the original order) – user2407038 Dec 29 '16 at 01:15
  • 1
    'V.unfoldr T.uncons' is the simplest, I think. The result should fuse with suitable Vector functions, though the text will be constructed. – Michael Dec 29 '16 at 01:53
  • 4
    `V.fromList . T.unpack`? maybe it doesn't make that big a difference after fusion – hao Dec 29 '16 at 02:41

1 Answers1

8

Better is kind of subjective. What makes something better? Less code? Less memory consumption? Less time needed?

That being said, yes, your algorithm is computationally inefficient, since snoc has O(n), where n is the length of your (intermediate) vector. Therefore, you have created an O(n*n) algorithm (n snoc operations, with increasing length, therefore 1 + 2 + 3 + ... + (n-1) = n*(n-1)/2).

Depending on the constants hidden away in the big-O notation, one would expect your function to work slow on large lists. But before we benchmark it, let's think about the alternatives:

  • we can go over an intermediate data structure which provides conversions for both data structures, e.g. [Char] via V.fromList and T.pack
  • we can use unfold the vector with V.unfoldr by splitting single chars from the T.Text with T.uncons.

So let's use Criterion to create some benchmarks:

module Main where

import Criterion
import Criterion.Main
import Control.Monad (replicateM)
import qualified Data.Text as T
import qualified Data.Vector as V
import System.Random (randomRIO)

setupEnv n = fmap T.pack $ replicateM n (randomRIO (' ','~'))

textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
   append c vec = V.snoc vec c

example :: T.Text
example = T.pack "Hello, World"

benchOn :: T.Text -> [Benchmark]
benchOn ~e = map (\(s,f) -> bench s $ whnf f e)
     [("textToVec", textToVec)
     ,("T.foldr (flip V.snoc) V.empty", T.foldr (flip V.snoc) V.empty)
     ,("V.fromList . T.unpack", V.fromList . T.unpack)
     ,("V.unfoldr T.uncons", V.unfoldr T.uncons)
     ]

main = defaultMain [
    bgroup "example" $ benchOn example
  , env (setupEnv $  1000) $ \ ~small -> bgroup  "1000" $ benchOn small 
  , env (setupEnv $ 10000) $ \ ~small -> bgroup "10000" $ benchOn small 
  ]

I've used lts-5 to compile it:

stack exec --package random\
           --package vector\
           --package text\
           --resolver=lts-5\
      -- ghc -O2 Benchmark.hs

Benchmarking results

We first run all variants on "Hello, world", then on a randomly generated string of length 1000, and then one of length 10000.

"Hello, world"

benchmarking example/textToVec
time                 1.106 us   (1.098 us .. 1.117 us)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.102 us   (1.099 us .. 1.107 us)
std dev              12.67 ns   (6.277 ns .. 21.00 ns)

benchmarking example/T.foldr (flip V.snoc) V.empty
time                 1.225 us   (1.222 us .. 1.229 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.229 us   (1.226 us .. 1.232 us)
std dev              10.48 ns   (7.634 ns .. 16.00 ns)

benchmarking example/V.fromList . T.unpack
time                 643.6 ns   (640.9 ns .. 646.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 643.6 ns   (641.7 ns .. 645.7 ns)
std dev              6.525 ns   (5.573 ns .. 7.860 ns)

benchmarking example/V.unfoldr T.uncons
time                 518.1 ns   (516.1 ns .. 520.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 518.0 ns   (516.1 ns .. 520.1 ns)
std dev              6.541 ns   (4.943 ns .. 8.479 ns)
variance introduced by outliers: 11% (moderately inflated)

As you can see, at this point, unfolding or going via list takes about half the time compared to your function. Both the unfolding approach as well the fromList . unpack take half a microsecond.

Random text with 1000 characters

benchmarking 1000/textToVec
time                 1.311 ms   (1.302 ms .. 1.320 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.342 ms   (1.331 ms .. 1.366 ms)
std dev              55.31 us   (27.75 us .. 96.80 us)
variance introduced by outliers: 29% (moderately inflated)

benchmarking 1000/T.foldr (flip V.snoc) V.empty
time                 6.013 ms   (5.953 ms .. 6.081 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 6.054 ms   (6.028 ms .. 6.097 ms)
std dev              100.8 us   (62.45 us .. 180.7 us)

benchmarking 1000/V.fromList . T.unpack
time                 26.83 us   (26.77 us .. 26.92 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 26.90 us   (26.83 us .. 27.00 us)
std dev              264.1 ns   (209.7 ns .. 348.2 ns)

benchmarking 1000/V.unfoldr T.uncons
time                 15.05 us   (14.93 us .. 15.19 us)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 15.02 us   (14.95 us .. 15.15 us)
std dev              303.0 ns   (206.6 ns .. 428.3 ns)
variance introduced by outliers: 19% (moderately inflated)

This is where it gets interesting. Your function is better than the point-free one. One would have to look into GHCs core to see where the bottleneck is.

That being said, for a string that's about 100 times longer than our original one (length "Hello, World" == 12), our two contestants V.unfoldr T.uncons and V.fromList . T.unpack take about 30/45 times longer on that data compared to the example one.

Your function, on the other hand is needs 1ms compared to the previous 1µs. For 100 times the data 1000 times the time. Oh oh…

Random text with 10000 characters

benchmarking 10000/textToVec
time                 190.9 ms   (188.7 ms .. 192.8 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 192.5 ms   (191.8 ms .. 193.6 ms)
std dev              1.096 ms   (684.4 us .. 1.426 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking 10000/T.foldr (flip V.snoc) V.empty For a 
time                 859.3 ms   (856.5 ms .. 860.9 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 861.9 ms   (860.7 ms .. 862.6 ms)
std dev              1.042 ms   (0.0 s .. 1.173 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 10000/V.fromList . T.unpack
time                 567.8 us   (565.5 us .. 570.6 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 571.7 us   (569.8 us .. 574.0 us)
std dev              7.132 us   (5.674 us .. 9.545 us)

benchmarking 10000/V.unfoldr T.uncons
time                 200.6 us   (199.4 us .. 201.8 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 200.2 us   (199.4 us .. 201.2 us)
std dev              2.948 us   (2.240 us .. 3.837 us)

We increased the data size by factor 10. Your function takes 200 times the time, compared to the previous run, whereas both unfoldr and fromList take only 20 times the time.

The winner

According to the benchmarks, V.unfoldr T.uncons takes the cake. So how would you, as a Haskell newbie, find that function?

Well, it's not that obvious at first. The "obvious" approach is usually to use a list, since they're everywhere in Haskell, and to be honest, V.fromList . T.unpack seems to work quite well. Also, that trick works for every data structure that provides a fromList (as target) or toList (as source). Therefore, it works for anyFoldableinstance (Text` isn't one).

You could have found that if you looked into the documentation and saw the type of both V.unfoldr and T.uncons at the same time:

V.unfoldr :: (b      -> Maybe (a   , b     )) -> b      -> V.Vector a
T.uncons  ::  T.Text -> Maybe (Char, T.Text)
V.unfoldr T.uncons ::                            T.Text -> V.Vector Char

But now you know that there's the unfoldr/uncons pattern, so you know how to use it in the future.

Zeta
  • 103,620
  • 13
  • 194
  • 236
  • Nice experiments. (Isn't `~e` equivalent to `e`, though? Is there any reason for the lazy pattern?) – chi Dec 29 '16 at 13:05
  • having explicit values for [empirical orders of growth](https://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth) or even a log-log plot of run times vs sizes would be extremely nice :) (as seen e.g. [here](http://stackoverflow.com/a/30116605/849891) and [here](http://stackoverflow.com/a/30181351/849891)). – Will Ness Dec 29 '16 at 14:46
  • 1
    @chi: It's recommended by the Criterion documentation – Zeta Dec 29 '16 at 16:32
  • 1
    @WillNess: as soon as I'm not stuck on a mobile Core2DuoT6500 (and an almost broken Win8.1 development). So... early next week? – Zeta Dec 29 '16 at 16:33
  • Here's the obvious conversion via the underlying Stream types http://sprunge.us/JZFf It's a little faster but not too much it seems. – Michael Dec 29 '16 at 17:04
  • It might be worth noting that since `Char` has an `Unbox` instance you can import `Data.Vector.Unboxed` instead of `Data.Vector`, which here is about 4x faster, as usual. – Michael Dec 29 '16 at 17:19
  • @Michael: If and only if OP actually wants to use `Data.Vector.Unboxed.Vector` instead of `Data.Vector.Vector`. But yeah, I should probably add that too. – Zeta Dec 29 '16 at 18:09
  • Wow, great analysis. I learned a lot from the answer and all the comments. I had actually looked at unfoldr when I was trying to get this to work, but couldn't easily figure out a function to pass into it. Is unfoldr/uncons a well known pattern? It seems like they were made for each other, but I don't see any hint of that in the documentation. In any case, both options seem much nicer than mine, so thanks! – happydave Dec 29 '16 at 18:19
  • @Zeta don't mind me, I'm just preaching this technique every chance I get. :) – Will Ness Dec 29 '16 at 20:23