19

I've been using Data.Binary to serialize data to files. In my application I incrementally add items to these files. The two most popular serialization packages, binary and cereal, both serialize lists as a count followed by the list items. Because of this, I can't append to my serialized files. I currently read in the whole file, deserialize the list, append to the list, re-serialize the list, and write it back out to the file. However, my data set is getting large and I'm starting to run out of memory. I could probably go around unboxing my data structures to gain some space, but that approach doesn't scale.

One solution would be to get down and dirty with the file format to change the initial count, then just append my elements. But that's not very satisfying, not to mention being sensitive to future changes in the file format as a result of breaking the abstraction. Iteratees/Enumerators come to mind as an attractive option here. I looked for a library combining them with a binary serialization, but didn't find anything. Anyone know if this has been done already? If not, would a library for this be useful? Or am I missing something?

Don Stewart
  • 137,316
  • 36
  • 365
  • 468
mightybyte
  • 7,282
  • 3
  • 23
  • 39
  • Could you write a streaming instance for Binary? It's relatively easy to write a chunk-wise (eager) encoder, that writes sets of *n* elements at a time. – Don Stewart Jun 01 '11 at 17:38
  • Oh, I don't disagree that it's fairly straightforward. I mainly am wanting to know if it's already been done. Also if not, are there existing abstractions or type classes that I should start from? – mightybyte Jun 01 '11 at 17:46
  • There's none that I know of. Though you might have luck looking for seekable file APIs on Hackage. – Don Stewart Jun 01 '11 at 17:57
  • You appear to be considering writing a library to provide the functionality you want, assuming none exist already. If you do that, and upload it to Hackage, would you return here afterward and link to it from this question? Either as an answer or an edit to the question. Stack Overflow has crazy-go-nuts google juice, so this question is likely to be found in the future by anyone searching for something similar. – C. A. McCann Jun 01 '11 at 18:08
  • The current Data.Binary format for lists is a rather egregious wart requiring two traversals of the structure. Unfortunately, it can't readily be changed for backwards compatibility reasons. You can, however, do as Don suggests below and build a newtype wrapper and use it carefully everywhere you need to serialize a list. – Edward Kmett Jun 01 '11 at 19:08
  • I had this problem when writing wave files with sndfile-enumerators. For that, I used a StateT to keep track of the amount of data written, and then seek to the count tag in the file and update it. I honestly don't know a solution to this problem that isn't ugly in one way or another. – John L Jun 01 '11 at 21:15
  • @Edward Kmett I think a new format could be made backwards compatible. If the length word is -1 then it's the new format, otherwise a regular length. Not beautiful, but backwards compatibility when somebody forgot to think about it from the start rarely is. – augustss Jun 01 '11 at 21:36
  • @augustss Hrmm. Thats not bad. -1 and 0 signaling cons and nil. or even -n signaling n elements known and the to prompt again for the remainder after that. – Edward Kmett Jun 02 '11 at 06:44

2 Answers2

7

So I say stick with Data.Binary but write a new instance for growable lists. Here's the current (strict) instance:

instance Binary a => Binary [a] where
    put l  = put (length l) >> mapM_ put l
    get    = do n <- get :: Get Int
                getMany n

-- | 'getMany n' get 'n' elements in order, without blowing the stack.
getMany :: Binary a => Int -> Get [a]
getMany n = go [] n
 where
    go xs 0 = return $! reverse xs
    go xs i = do x <- get
                 x `seq` go (x:xs) (i-1)
{-# INLINE getMany #-}

Now, a version that lets you stream (in binary) to append to a file would need to be eager or lazy. The lazy version is the most trivial. Something like:

import Data.Binary

newtype Stream a = Stream { unstream :: [a] }

instance Binary a => Binary (Stream a) where

    put (Stream [])     = putWord8 0
    put (Stream (x:xs)) = putWord8 1 >> put x >> put (Stream xs)

    get = do
        t <- getWord8
        case t of
            0 -> return (Stream [])
            1 -> do x         <- get
                    Stream xs <- get
                    return (Stream (x:xs))

Massaged appropriately works for streaming. Now, to handle silently appending, we'll need to be able to seek to the end of the file, and overwrite the final 0 tag, before adding more elements.

Don Stewart
  • 137,316
  • 36
  • 365
  • 468
  • hmm, this seems to work properly for encoding, but the decoding doesn't seem to be actually streaming (it reads the whole input in for me, when I try to just use the first entry) – gatoatigrado Feb 04 '14 at 22:04
4

It's four years since this question has been answered, but I ran into the same problems as gatoatigrado in the comment to Don Stewart's answer. The put method works as advertised, but get reads the whole input. I believe the problem lies in the pattern match in the case statement, Stream xs <- get, which must determine whether or not the remaining get is a Stream a or not before returning.

My solution used the example in Data.Binary.Get as a starting point:

import Data.ByteString.Lazy(toChunks,ByteString)
import Data.Binary(Binary(..),getWord8)
import Data.Binary.Get(pushChunk,Decoder(..),runGetIncremental)
import Data.List(unfoldr)

decodes :: Binary a => ByteString -> [a]
decodes = runGets (getWord8 >> get)

runGets :: Get a -> ByteString -> [a]
runGets g = unfoldr (decode1 d) . toChunks
  where d = runGetIncremental g

decode1 _ [] = Nothing
decode1 d (x:xs) = case d `pushChunk` x of
                     Fail _ _ str  -> error str
                     Done x' _ a   -> Just (a,x':xs)
                     k@(Partial _) -> decode1 k xs

Note the use of getWord8 This is to read the encoded [] and : resulting from the definition of put for the stream instance. Also note, since getWord8 ignores the encoded [] and : symbols, this implementation will not detect the end of the list. My encoded file was just a single list so it works for that, but otherwise you'll need to modify.

In any case, this decodes ran in constant memory in both cases of accessing the head and last elements.

trevor cook
  • 1,531
  • 8
  • 22