How do I parse a large data block into memory in Haskell?

Question

On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that

looks like a list
has O(1) lookup
has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
has very little memory overhead

I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:

3-10 photographs, the smallest being 45MB and the largest being 110MB.

Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...

I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

Then I wrote a parser that works on three vectors like so:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

And at the end of the parser, I construct the RGB8 like so:

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress so that redC, greenC, and blueC are all !(Vector Word8), then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.

On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.

The fundamental problem is actually pretty straightforward:

How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.

In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.

Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress data structure of Codec.Image.Netpbm.

To run the performance test

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr

Have you considered using [mmap](http://hackage.haskell.org/package/mmap)? If the operations you want to perform on the images are not too IO intensive, it may be worth using it. — Danny Navarro, Mar 20 '14 at 07:49
Another option is to use some library specialized in large arrays like [`repa`](http://hackage.haskell.org/package/repa) or even [`accelerate`](http://hackage.haskell.org/package/accelerate). They are written with high performance in mind, so they should come with many optimizations for memory efficiency as well. — Danny Navarro, Mar 20 '14 at 07:56
Your bitbucket package depends of another package of yours named "alyra-common", but the current version doesn't match (> 0.2.1 required, but we have 0.2.0). Could you maybe update your bitbucket stuff? — András Kovács, Mar 20 '14 at 12:47
If you want help getting your mutable vector stuff to typecheck, show us the code. Bonus points for minimizing it to the smallest chunk of code that is complete (I can just stick it in a file and run it through GHC) and still wrong. — Daniel Wagner, Mar 20 '14 at 12:58
@AndrásKovács Sorry. Problem fixed. I changed the alyra-common repository to git and pushed the 0.2.1 tag. — Savanni D'Gerinel, Mar 20 '14 at 14:24
@Mau 4767x3195, 8-bit. That's a resolution from my old camera, and a bitlevel from me being sloppy about the parameters I gave to ufraw. During artistic work, I'll be needing this application to handle multiple files that are 6024x4024, 16-bit. — Savanni D'Gerinel, Mar 20 '14 at 14:26
@DannyNavarro I hadn't considered that. I could try `mmap` for loading an image, but after the images are loaded I'll be using them to generate a new image, so I may get myself trapped back in the world of too much memory allocation. I'll look into `repa` and `accelerate`. — Savanni D'Gerinel, Mar 20 '14 at 14:29
@Mau no, it's not that big. 4767 * 3195 * 3 is simply 45MB. I'm pretty sure that the extra memory is, in one instance, the overhead of modifying a pure data structure, so all of the instances of the constructor over and over again. The memory copying GC behaviour is incredible. — Savanni D'Gerinel, Mar 20 '14 at 14:33

András Kovács · Accepted Answer · 2014-03-20T15:16:59.817

I first thought that just simply reading the whole chunk of bytestring and then unzipping the contents into unboxed vectors would be good enough. Indeed, the parsing code you posted would be rather bad even without the mysterious space leak: you copy the entirety of all three vectors on every single byte of the input! Talk about quadratic complexity.

So I wrote the following:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

And then tested it with a 45 Mb file of random bytes. I admit I was surprised that this code resulted in gigabytes of RAM usage. I'm curious as to where exactly the space leaks.

Mutable vectors work well though. The following code uses 133 Mb RAM and Criterion benchmarks it to 60 ms file reading included. I included some explanations in the comments. There is also ample material on the ST monad and mutable vectors on SO and elsewhere (I concur though that the library documentations are unfriendly to beginners).

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]

I started with a Vector because I assumed that structural sharing and all would prevent the copy-on-update operation. Technically, though, that wouldn't make any sense. All the memory overhead I get with a normal list is probably necessary to allow structural sharing. — Savanni D'Gerinel, Mar 20 '14 at 15:07
It doesn't seem to matter how many libraries I learn to use. I *always* feel like a Haskell newb every time I step into any new library. — Savanni D'Gerinel, Mar 20 '14 at 15:07
Finally, why did you use the `unsafe` operations instead of the normal ones? — Savanni D'Gerinel, Mar 20 '14 at 15:08
@SavanniD'Gerinel it's just because I already established the correctness of the bounds with the initialization, so I might as well skip the bounds check. In most programming languages I'd consider the widespread use of unsafe indexing pretty bad, but in Haskell ST code is pretty rare, and when it's used it's when we really care about speed, so I might as well go all-out in those cases :) — András Kovács, Mar 20 '14 at 15:14
@SavanniD'Gerinel, note though that the `unsafeFreeze` is a bit more justified since we really do nothing with the vector aside from returning them at that point. — András Kovács, Mar 20 '14 at 15:15
Oh, wow this is fast. I only see 73MB maximum in use, but the performance is just shocking. I'll clean up my code and push a new update so that you can see it. — Savanni D'Gerinel, Mar 20 '14 at 16:18
@SavanniD'Gerinel also, it occurs to me that you could use `unsafeNew` instead of `new` to make it even faster. — András Kovács, Mar 20 '14 at 17:09

Gabriella Gonzalez · Answer 2 · 2014-03-21T04:11:04.887

Here's a version which parses the file straight from the disk without loading any intermediate into memory:

import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap

I tested it without the header logic on a 50 MB file and it runs in roughly the same amount of space.

How do I parse a large data block into memory in Haskell?

2 Answers2

Linked