Dealing with huge data

Question

Let's assume that I have a big file (500GB+) and I have a data record declaration Sample which indicates a row in that file:

data Sample = Sample {
field1 :: Int,
field2 :: Int
}

Now what is the data structure suitable for processing (filter/map/fold) on the collection of these Sample datas ? Don Stewart has answered here that the Sample type should not be treated as a list [Sample] type but as a Vector type. My question is how does representing it as Vector type solve the problem ? Doesn't representing the file contents as a vector of Sample type will also occupy around 500Gb ?

What is the recommended method for solving these types of problem ?

It's not really clear what your problem is. How to structure data? or how to locate a data record? Is it not fast enough? takes too much space? Also, can't you just ask Jon to elaborate on his answer? — Scott Solmer, Jun 20 '14 at 12:21
I think the comments from 2009 remain. Don't store it as a list in memory, process incrementally using laziness and/or mmap. — Don Stewart, Jun 20 '14 at 12:24
@DonStewart Thanks, but how does treating them as `Vector` solves this problem ? If I have to do two or more `map` operations on the entire Vector data of `Sample`, doesn't it exhaust out the memory ? — Sibi, Jun 20 '14 at 12:49
You can't hold all the data in memory no matter what you do. Vector simply increases the amount you can process in each chunk. You must still use laziness or explicit laziness via mmmap to process incrementally. — Don Stewart, Jun 20 '14 at 12:50

score 3 · Accepted Answer · edited May 23 '17 at 10:32

3

As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).

Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.

If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as @Don said.

edited May 23 '17 at 10:32

Community

1
1

answered Jun 20 '14 at 12:59

Zeta

103,620
13
194
236

Thanks, Is `Data.Conduit.List` reasonable for this ? As soon as I build up the list to pass it to `sourceList`, my memory will explode. – Sibi Jun 20 '14 at 13:05
1

You cannot build up the source list—you have to generate the data incrementally and feed it directly into the conduit. – J. Abrahamson Jun 20 '14 at 13:32
@Sibi You need to use `sourceFile` from `Data.Conduit.Binary`. – Zeta Jun 21 '14 at 06:42

Dealing with huge data

1 Answers1