Performance improvements when using template-haskell to deserialize a data type

Question

I'm compiling a Haskell executable that, on startup, reads about 50MB of data from the file system that has been serialized using the serialise package and then applies some transformations to it before continuing.

I'd like to improve the start up speed of the executable, and I can theoretically use template haskell to deserialize the files and write them as data constructors. But I'm wondering if this would actually improve performance? If the bulk of the time the code takes is calling the data constructors (meaning if the file IO and deserialization is fast) then it wouldn't be worth it, whereas if calling the data constructors is fast then it may be worth it.

Also, does GHC have any notion of compile-time evaluation for large data structures. Ie if I have something of type [Foo] that is known at compile time and contains ~50MB of data, is there any way that the executable can contain that precompiled in whatever the haskell equivalent of the stack is, or will it be lazily evaluated like everything else?

Thanks in advance for your help & advice!

Are you considering TH because you want to deserialize the data at compile time rather than runtime, or for some other reason? — amalloy, Feb 20 '22 at 12:50
Yes, exactly, it's to deserialize the data at compile time. If there's a better way to do this then I'm all for it, but I've only ever used TH for things like that. — Mike, Feb 20 '22 at 12:58
If you want to try this be sure to read this stack overflow answer. It will help you avoid some bad performance issues. https://stackoverflow.com/questions/12716215/load-pure-global-variable-from-file/12717160#12717160 — David Fox, Feb 20 '22 at 15:42

score 2 · Accepted Answer · answered Feb 20 '22 at 13:23

I'm pessimistic. You seem unlikely to save time on file I/O: if you deserialize 50MB worth of stuff at compile time, you have to bake that into the executable, and it will probably get about 50MB larger, assuming that the serialization format and GHC's format are both reasonably efficient encodings. Thus, loading the executable into memory will get slower, by about the amount of time you were previously spending on reading the data file.

Likewise, GHC will have to deserialize whatever format it uses to bake the data into the executable. A program could avoid this if the in-memory data structure were identical to the on-disk representation, but I can't imagine that being the case, since the normal in-memory representation is rife with pointers. Here again, it seems likely that GHC's internal format is not much cheaper to deserialize than CBOR, so any costs you avoid by not reading the file, you will incur by making the executable slower to prepare.

It _should_ be possible to store a structure, pointers and all, by way of [compact regions](https://hackage.haskell.org/package/ghc-compact). This would probably be much less space efficient than the serialization format though, and the overhead of loading more data may very well outweigh the serialization. — leftaroundabout, Feb 20 '22 at 13:57

Performance improvements when using template-haskell to deserialize a data type

1 Answers1