7

I am lazily encoding lists using this code (taken from this SO question):

import Data.Binary

newtype Stream a = Stream { unstream :: [a] }

instance Binary a => Binary (Stream a) where

    put (Stream [])     = putWord8 0
    put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)

The problem is that the decoding implementation is not lazy:

    get = do
        t <- getWord8
        case t of
            0 -> return (Stream [])
            1 -> do x         <- get
                    Stream xs <- get
                    return (Stream (x:xs))

This looks to me like it should be lazy, but if we run this test code:

head $ unstream (decode $ encode $ Stream [1..10000000::Integer] :: Stream Integer)

memory usage explodes. For some reason it wants to decode the whole list before letting me look at the first element.

Why is this not lazy, and how I can make it lazy?

Community
  • 1
  • 1
Mike Izbicki
  • 6,286
  • 1
  • 23
  • 53

1 Answers1

7

It is not lazy because the Get monad is a strict state monad (in binary-0.5.0.2 to 0.5.1.1; it was a lazy state monad before, and in binary-0.6.* it has become a continuation monad, I haven't analysed the strictness implications of that change):

-- | The parse state
data S = S {-# UNPACK #-} !B.ByteString  -- current chunk
           L.ByteString                  -- the rest of the input
           {-# UNPACK #-} !Int64         -- bytes read

-- | The Get monad is just a State monad carrying around the input ByteString
-- We treat it as a strict state monad. 
newtype Get a = Get { unGet :: S -> (# a, S #) }

-- Definition directly from Control.Monad.State.Strict
instance Monad Get where
    return a  = Get $ \s -> (# a, s #)
    {-# INLINE return #-}

    m >>= k   = Get $ \s -> case unGet m s of
                             (# a, s' #) -> unGet (k a) s'
    {-# INLINE (>>=) #-}

thus the final recursive

get >>= \x ->
get >>= \(Stream xs) ->
return (Stream (x:xs))

forces the entire Stream to be read before it can be returned.

I don't think it's possible to lazily decode a Stream in the Get monad (so a fortiori not with the Binary instance). But you can write a lazy decoding function using runGetState:

-- | Run the Get monad applies a 'get'-based parser on the input
-- ByteString. Additional to the result of get it returns the number of
-- consumed bytes and the rest of the input.
runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
runGetState m str off =
    case unGet m (mkState str off) of
      (# a, ~(S s ss newOff) #) -> (a, s `join` ss, newOff)

First write a Get parser that returns a Maybe a,

getMaybe :: Binary a => Get (Maybe a)
getMaybe = do
    t <- getWord8
    case t of
      0 -> return Nothing
      _ -> fmap Just get

then use that to make a function of type (ByteString,Int64) -> Maybe (a,(ByteString,Int64)):

step :: Binary a => (ByteString,Int64) -> Maybe (a,(ByteString,Int64))
step (xs,offset) = case runGetState getMaybe xs offset of
                     (Just v, ys, newOffset) -> Just (v,(ys,newOffset))
                     _                       -> Nothing

and then you can use Data.List.unfoldr to lazily decode a list,

lazyDecodeList :: Binary a => ByteString -> [a]
lazyDecodeList xs = unfoldr step (xs,0)

and wrap that in a Stream

lazyDecodeStream :: Binary a => ByteString -> Stream a
lazyDecodeStream = Stream . lazyDecodeList
Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • Is there a design resign why the Put monad is lazy, but the Get monad is strict? This seems very awkward. – Mike Izbicki Jul 27 '12 at 23:41
  • 2
    `Get` was lazy in `binary < 0.5`. I'm not sure about the details, but iirc, the reason to switch to a strict implementation (even using unboxed tuples) was performance. With a lazy state monad, it is terribly easy to build up huge thunks, and people ran into that problem - enormous memory consumption and slow performance - not too seldomly. Most of the time, a strict state monad is better or at least not worse than a lazy one, so overall the gain seems larger than the loss. It is probably less work to write a lazy decoder with the current implementation than to avoid leaks with the old. – Daniel Fischer Jul 27 '12 at 23:59
  • I thought the whole point of having the cereal package and binary was that if you wanted strictness you use cereal. Is there another difference? (And thanks for all the help!) – Mike Izbicki Jul 28 '12 at 04:05
  • 4
    It was changed for performance reasons. When binary was added to GHC, we found that it was 2 to 3x faster when using a strict state monad. – Don Stewart Jul 28 '12 at 13:43
  • Are there any plans to get a lazy version like this into `binary`, e.g. `Data.Binary.Lazy`? With `unGet` and the likes not being exported, obviously we cannot use this as a workaround. – nh2 Aug 21 '12 at 11:47
  • 1
    @nh2 I don't think so. You can't mix lazy and strict decoding, so it would have to be a separate implementation. Maybe a `Data.Binary[.Put/Get].Lazy` would be included if a feature request gets enough support. Otherwise, one could make one's own package providing lazy de- and serialization. – Daniel Fischer Aug 26 '12 at 10:57