5

A simplified version of the question

I have a huge matrix-like dataset, that we for now can pretend is actually an n-by-n matrix stored on-disk as n^2 IEEE-754 doubles (see details below the line on how this is a simplification - it probably matters). The file is on the order of a gigabyte, but in a certain (pure) function I will only need on the order of n of the elements contained in it. Exactly which elements will be needed is complicated, and not something like a simple slice.

What are my options for decoupling reading the file from disk and the computation? Most of all, I'd like to treat the on-disk data as if it were in memory (I am of course ready to swear to all the gods of referential transparency that the data on disk will not change). I've looked at mmap and friends, but some cursory testing shows that these seem not to aggressively enough free memory.

Do I have to go couple my computations to IO if I need such fine-grained control of how much of the file is kept in memory?


A more honest description of the on-disk data

The data on disk isn't actually as simple as described. Something closer to the truth would be the following: A file begins with a 32 bit integer n. The following then occurs precisely n times: A 32 bit integer m_i > 0 (1 ≤ i ≤ n), followed by exactly m_i IEEE-754 doubles x_(i,1),…,x_(i, m_i). (So, this is a jagged two-dimensional array).

In practice, determining i and j for which x_(i, j) is needed depends highly on the m_i's. When approaching the problem with mmap, the need to read so many of these m_is seems to essentially load the entire file into memory. The problem is that it all seems to stay there, and I worry that I will have to pull my computation into IO to have more fine-grained control over the releasing of this memory.

Moreover, "the data structure" actually consists of a large number of these files parameterized by their file names. Together they amount to about a gigabyte.


An attempt at a more handwaving, but possibly easier to understand version of the question

Say I have some data on disk consisting of n^2 elements. A pure Haskell function needs on the order of n of the elements, but which of them depends in a complicated way on the values. I do not want to load the entire file into memory, because it is huge. One solution is to throw my function into the IO monad and read out elements as they are needed, but I call this "giving up". mmap lets us treat on-disk data as if it were in memory, essentially doing lazy IO with help from the OS' virtual memory system. This is nice, but since determining which elements of the data are needed requires accessing a lot of the file, mmap seems to keep way too much of the file in memory. In practice, I find that reading the data I need to determine the data I actually need loads the entire file into memory when using mmap.

What options do I have?

gspr
  • 11,144
  • 3
  • 41
  • 74
  • Upon reading my own question, I now realize it is extremely poorly written. I would have trouble understanding it myself. I will try to improve upon it if I can find a succinct way of describing my problem without going into the nitty gritty details of the on-disk data. – gspr Jun 17 '14 at 18:38
  • 2
    It would certainly be helpful to seperate the `m_i` in some way, e.g. first load all these into a strict array manually, thus scanning the entry points for each array row. Then later just use those offsets and ignore the integers in the file. – leftaroundabout Jun 17 '14 at 18:41
  • @leftaroundabout: Yeah that's a possible compromise. It won't couple my computation to IO, but instead it would demand an IO preprocessing step. I guess I can live with that, unless something better comes along. Thanks. – gspr Jun 17 '14 at 18:43
  • I wonder if I could patch mmap (the Hackage package, I mean) to do madvise(2) with `MADV_DONTNEED` for a more aggressive freeing behavior… thoughts? – gspr Jun 17 '14 at 19:50
  • @gspr - Is your main concern staying out of a monad? or is it making sure resources are released? – Davorak Jun 17 '14 at 20:57
  • @Davorak: Making sure I don't keep more ("more" meaning n^2 instead of n) things in memory than I need, while not banishing my nice pure computation to `IO`. Let's put it like this: If it weren't for the fact that too much data seems to linger in memory, mmap-vector would be pretty much perfect for the job. – gspr Jun 17 '14 at 21:04
  • @Davorak: I guess the answer to your questions is thus "yes". :-) – gspr Jun 17 '14 at 21:08
  • @gspr - It still is not clear if you want to stay out of say the stateT monad, or a pipes monad, etc which are not necessarily in IO. That said if you want prompt resource finalization something somewhere will need to be in IO, not necessarily you computation however. – Davorak Jun 17 '14 at 22:00
  • "but some cursory testing shows that these seem not to aggressively enough free memory." Is the problem a) that this memory is slowing down the computation, or b) that you wish the memory would be freed faster to return resources to the system? – Thomas Jun 23 '14 at 06:49
  • @Davorak: Indeed, I will probably have to use pipes or something like that. Thanks. – gspr Jun 23 '14 at 21:46
  • @Thomas: b, for sure. – gspr Jun 23 '14 at 21:47
  • 1
    Then perhaps the simplest solution is to call System.Mem.performGC immediately after you've done your computation. – Thomas Jun 23 '14 at 23:11

1 Answers1

1

I would suggest that you write an interface that is entirely in IO, where you have an abstract type that contains both a Handle and information about the overall structure of your data (perhaps all the m_is if you can fit them), and this is complemented by IO operations that read out precise bits of the data by seeking in the handle.

I would then simply wrap this interface in a bunch of unsafePerformIO calls! This is effectively what mmap does behind the scenes, in a sense. You just are doing so in a more explicitly managed way.

Assuming you aren't worried about anyway "swapping out" the file behind your back, you can get an interface that you can reason about purely while it actually does IO where necessary to give the explicit control over memory you need.

sclv
  • 38,665
  • 7
  • 99
  • 204