On reflection, this entire question can be boiled down to something much more concise. I'm looking for a Haskell data structure that
- looks like a list
- has O(1) lookup
- has either O(1) element replacement or O(1) element append (or prepend... I could reverse my index lookups if that were the case). I can always write my later algorithms with one or the other in mind.
- has very little memory overhead
I'm trying to build an image file parser. The file format is your basic 8-bit color ppm file, though I intend to support 16-bit color files and PNG and JPEG files. The existing Netpbm library, despite a lot of unboxing annotations, actually consumes all available memory when trying to load the files that I work with:
3-10 photographs, the smallest being 45MB and the largest being 110MB.
Now, I can't understand the optimizations put into the Netpbm code, so I decided to have my own try at it. It's a simple file format...
I have started by deciding that no matter what the file format, I'm going to store the final image uncompressed in this format:
import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
width :: Int
, height :: Int
, redChannel :: Vector Word8
, greenChannel :: Vector Word8
, blueChannel :: Vector Word8
}
Then I wrote a parser that works on three vectors like so:
import Data.Attoparsec.ByteString
data Progress = Progress {
addr :: Int
, size :: Int
, redC :: Vector Word8
, greenC :: Vector Word8
, blueC :: Vector Word8
}
parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
| addr == size = return progress
| addr < size = do
!redV <- anyWord8
!greenV <- anyWord8
!blueV <- anyWord8
parseColorBinary progress { addr = addr + 1
, redC = redC V.// [(addr, redV)]
, greenC = greenC V.// [(addr, greenV)]
, blueC = blueC V.// [(addr, blueV)] }
And at the end of the parser, I construct the RGB8 like so:
Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC
Written like this, the program, loading a single one of these 45MB images, will consume 5GB or more of memory. If I change the definition of Progress
so that redC
, greenC
, and blueC
are all !(Vector Word8)
, then the program remains within reasonable memory confines, but takes so long to load a single file that I haven't allowed it to actually finish. Finally, if I replace the vectors here with standard lists, my memory usage shoots back up to 5GB per file (I assume... I actually run out of space before I hit that), and load time is on the order of 6 seconds. Ubuntu's preview application, once started, loads and renders the file nearly instantly.
On the theory that each call to V.// is actually fully copying the vector every single time, I tried switching to Data.Vector.Unboxed.Mutable
, but... I can't even get that to typecheck. The documentation is nonexistent and understanding what the data types are doing is going to require fighting with multiple other libraries as well. And I don't even know if it will solve the problems, so I'm very reluctant to even try.
The fundamental problem is actually pretty straightforward:
How do I quickly, and without using an obscene amount of memory, read, retain, and potentially manipulate a very large data structure? All of the examples I have found are about generating temporarily huge data and then getting rid of it as soon as possible.
In principal, I want the final representation to be immutable, but I don't too much care if I have to use a mutable structure to get there.
Just for completeness, the complete code (BSD3-licensed) is on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . The performance
branch contains a strict version of the parser, which can be made unstrict with a quick change in the Progress
data structure of Codec.Image.Netpbm
.
To run the performance test
ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr