1

I am trying to parse a binary file into a haskell vector. I can load my file into a regular list, but since I have more than 10000000 elements for each file, I have terrible performances.

To parse the binary file, I use Data.Binary.Get and Data.Binary.IEEE754 since I intend to read float values. I am trying to build my vector as Mutable to then return it freezed.

I end up at a where I have a problem because Get is not an instance of Control.Monad.Primitive.PrimMonad which looks pretty obscure to me.

import qualified Data.ByteString.Lazy        as B
import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Data.Vector.Unboxed         as U
import Data.Binary.Get
import Data.Binary.IEEE754

type MyVectorOfFloats = U.Vector Float

main = do
    -- Lazyly read the content of the file as a ByteString
    file_content <- B.readFile "vec.bin"
    -- Parse the bytestring and get the vector
    vec <- runGet (readWithGet 10) file_content :: MyVectorOfFloats
    -- Do something usefull with it...
    return ()


readWithGet :: Int
            -> Get MyVectorOfFloats -- ^ Operates in the Get monad
readWithGet n = do
    -- Initialize a mutable vector of the desired size
    vec <- UM.new n
    -- Initialize the vector with values obtained from the Get monad
    fill vec 0
    -- Finally return freezed version of the vector
    U.unsafeFreeze vec
  where
    fill v i
        | i < n = do
            -- Hopefully read one fload32 from the Get monad
            f <- getFloat32le
            -- place the value inside the vector
            -- In the real situation, I would do more complex decoding with
            -- my float value f
            UM.unsafeWrite v i f
            -- and go to the next value to read
            fill v (i + 1)
        | otherwise = return ()

The example above is quite simple, in my situation I have run-length like decoding to do, but the problem stays the same.

First, does the libraries I selected seem adequate for my use ? I currently do not really need the all vector in memory at once. I can operate on chunks. Something from pipes or Conduit looks like interesting.

Do I have to make Get an instance of Control.Monad.Primitive.PrimMonad to do what I want ?

I think I could try to do some unfolding pattern to build the vector without mutable state.

Lancelot SIX
  • 156
  • 2
  • There are exactly 2 `PrimMonad`s: `IO` and `ST`. Any other monad will do though, if you can write it as a transformer over one of these... [there's quite an interesting discussion going on right now](http://stackoverflow.com/questions/24515876/is-there-a-monad-that-doesnt-have-a-corresponding-monad-transformer-except-io) about whether this should be possible for basically _all_ monads, including `Get`... but at any rate it doesn't have a working instance, so, you can't do it right away. — The best performance for such a huge load of binary data is certainly to fiddle with low-level `Storable`. – leftaroundabout Jul 02 '14 at 12:23
  • Another note is that `Get` always loads data in its entirety. Do you really want all 10000000 numbers in memory at once? If you could stream the data in chunks, you'd be a lot happier with the memory use. – Carl Jul 02 '14 at 14:40

1 Answers1

1

If you don't need all the data at once, you should probably be using a streaming library. (If things are very simple, you might get by with lazy I/O.)

Your error comes from the fact that you've declared that the 'do' block operates in the Get monad, but UM.new can only operate in the ST or IO monad. You'll want to change readWithGet to be in the IO or ST monad, although it might still use the Get monad "under the hood" (and call runGet internally).

Here's what I came up with for converting between Get and Parser (from pipes-parse):

type GetParser m a = Parser ByteString (EitherT GetFail m) a

data GetFail = GetFail !ByteOffset String

get2parser :: Get a -> GetParser m a
get2parser = decoder2parser . runGetIncremental

decoder2parser :: Decoder a -> GetParser m a
decoder2parser (Fail r off err) = unDraw r >> lift (left $ GetFail off err)
decoder2parser (Partial cont) = draw >>= decoder2parser . cont
decoder2parser (Done r _ a) = unDraw r >> return a