Compiling large data structures in Haskell

Question

I have a CSV file with stock trading history, its size is 70 megabytes. I want to run my program on it, but do not want to wait for 30 seconds every start.

1. Just translate CSV file into Haskell source file like this:

From                       | TO
-------------------------------------------
1380567537,122.166,2.30243 | history = [
...                        |       (1380567537,122.166,2.30243)
...                        |     , ...
...                        |     ]

2. Use Template Haskell to parse file compile-time.

Trying first approach I found my GHC eat up 12gb of memory after 3 hours of trying to compile one list (70 mb source code).

So is TH the single available approach? Or I can just use hard-coded large data structure in source file? And why GHC can't compile file? Does it go to combinatorial explosion because of complex optimizations or something?

Using fast libraries like bytestring and attoparsec will reduce time to much lesser than 30 seconds. — Satvik, Oct 01 '13 at 06:14
Possible duplicate of http://stackoverflow.com/a/6403503/83805 — Don Stewart, Oct 01 '13 at 07:15
Have you tried [cassava](http://blog.johantibell.com/2012/08/a-new-fast-and-easy-to-use-csv-library.html)? — jtobin, Oct 01 '13 at 20:35
Don, yes, it is related, but answer on that question was about inserting bytestring-literals into the code and then converting it into a structure; but I wanted to get already compiled structure in my program. — Kirill Taran, Oct 03 '13 at 11:55
jtobin, question is not about it, but I will try it. Thank you anyway. — Kirill Taran, Oct 03 '13 at 11:56
Here is my benchmark: http://jsbin.com/ucIbIgu (CSV stands for Data.CSV from MissingH package). Serialization is really blazing fast. — Kirill Taran, Oct 05 '13 at 08:30

score 3 · Accepted Answer · answered Oct 01 '13 at 06:15

3

Hard-coding so much data is not a common use-case, so it isn't surprising the compiler doesn't handle it well.

A better solution would be to put the data into some format that is easier to read than CSV. For example, consider writing a program that parses your CSV file and serializes the resulting structure using some package like cereal. Then your main program can read the binary file, which should be much faster than your CSV file.

This approach has the added benefit that running your program on new data will be easier and won't require recompiling.

answered Oct 01 '13 at 06:15

Tikhon Jelvis

67,485
18
177
214

I assume that built-in-program data would give more performance. Is it true or boost wouldn't be significant? Anyway, nice tip. I didn't recall about this possibility. – Kirill Taran Oct 01 '13 at 07:10
I really doubt the performance difference is significant. However, I'm not entirely certain--I'd have to run a benchmark or something. More importantly, though, I'm think it is *very* likely that the cereal approach is fast *enough* for your purposes, and it sounds much easier to implement. That's what I'd try first. – Tikhon Jelvis Oct 01 '13 at 07:19
1

Here is my benchmark: http://jsbin.com/ucIbIgu (CSV stands for Data.CSV from MissingH package). Serialization is really blazing fast. – Kirill Taran Oct 05 '13 at 08:32

Compiling large data structures in Haskell

1 Answers1