10

Can anybody advise an alternative to importing a couple of GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?

The - too large to post here - research-problem involved simple statistical operations with double as much GB of data (around 34) than RAM available (16). To handle the data size problem I just split things up and used a Get / Clear strategy to do the math.

It does work, but calling Get["bigfile.mx"] takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.

So my question really is: What is the most efficient way to handle truly large data set imports in Mathematica?

I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"].

Anyone has some experience to share?

(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).

Arnoud Buzing
  • 15,383
  • 3
  • 20
  • 50
Rolf Mertig
  • 1,360
  • 8
  • 22
  • 1
    [Does this help](http://stackoverflow.com/q/7525782/616736)? Another [related question](http://stackoverflow.com/q/8247005/616736). – abcd Dec 20 '11 at 22:38
  • @yoda Note that Rolf is using MX which is the native binary format of Mathematica, and in my exeprience faster than anything else when using `Import`/`ReadList`. I don't know about `BinaryReadList` ... – Szabolcs Dec 21 '11 at 08:29
  • @Rolf **+1**, very relevant question. It doesn't answer you, but you'll surely be interested [in this presentation](http://library.wolfram.com/infocenter/Conferences/8025/). It seems Mathematica 9 is bringing significant improvements in this area. – Szabolcs Dec 21 '11 at 08:30
  • @Rolf I just tested with a 370 MB mx file, and it imports in less than a second here. I did `rr = RandomReal[1, {100,100,100,50}]; DumpSave["rr.mx",rr]; Timing[Get["rr.mx"];]`. I wonder why our experiences differ. What kind of data are you reading? Is my version only fast because it only has a single large packed array? – Szabolcs Dec 21 '11 at 08:38
  • @Szabolcs Yes, I have a list of different length packed arrays, and that list itself cannot be packed. – Rolf Mertig Dec 21 '11 at 09:16
  • @Rolf Are you sure your sublists are packed? I still can't reproduce the problem here. I tried `Table[RandomReal[1, {RandomInteger[{100, 10000}]}], {10000}]` which makes an array of 10000 packed arrays. The mx file is ~400 MB and importing is still below 1 second. If I make this 3 times smaller but use ``Developer`FromPackedArray`` on it to unpack, import is still a reasonbale 3 seconds only. Can you post an example data structure that reproduces the problem? – Szabolcs Dec 21 '11 at 09:39
  • @Rolf I think this would depend on the application. Some things can be done via techniques I exposed in the answers linked by yoda, or the other answers given here. The main question IMO seems to be - do you need to import any given file just once, or not. If not, you may benefit from memory mapping and / or databases, where you could use things like Hibernate which has sophisticated caching mechanisms. You could implement lazy Mathematica structures on top of that. In any case, I agree with Szabolcs that we'd need to know more specifics about the problem to give a more definite answer. – Leonid Shifrin Dec 21 '11 at 10:42
  • @RolfMertig, when you are using Get[] I assume you are reading from a local drive? Reading from a non-local drive (like a network drive) can cause slowness as well. In my environment there is an order of magnitude difference between these two scenarios. – Arnoud Buzing Dec 21 '11 at 17:43
  • @RolfMertig I'd be interested in what solution you went with at last as I'm facing a similar problem at the moment. – Szabolcs Jan 17 '12 at 17:14
  • I stayed with the .mx file-based approach and carefully Clear-ing big expressions in the Do loop. It was an old project which I happened to have a look again at, but did not want to spend too much time on it. – Rolf Mertig Jan 17 '12 at 21:52

2 Answers2

3

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.


Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.


Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).


If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

Open it:

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

Skip the first 5 million entries:

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

Read 5 million entries:

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

Read all the remaining entries:

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)


EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).

Community
  • 1
  • 1
Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • Have you tried my `dynamicPartition` (or just the core `dynP`) function in the toolbag post? I believe it should be a little faster that what you proposed. If it is, will you include a link? – Mr.Wizard Dec 21 '11 at 17:08
  • @Mr.Wizard My point here was merely to show that storing the data in a flat format and partitioning it in-kernel is not going to add noticeable overhead (not to find the best way to partition). Link added, of course. – Szabolcs Dec 22 '11 at 10:44
  • @Mr.Wizard Why not, your function does exactly the same thing what I was showing here! I just pointed out it was not the main point of the answer (which is really just some ideas and not a complete answer) – Szabolcs Dec 22 '11 at 11:07
1

I think the two best approaches are either:

1) use Get on the *.mx file,

2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.

Generally speaking Get with MX files is pretty fast.

Are sure this is not a swapping problem?

Edit 1: You could then use also write in an import converter: tutorial/DevelopingAnImportConverter

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • It is not a swapping problem. The problem is that I have more data than fit in RAM so I have to read in sequentially parts of the data and this is done multiple times, so if it takes half a minute to read in such an MX file it is noticeable. Things do *work*, it just needs more than a day of CPU time everything (there is an outer optimization loop), so I was thinking about how to speed up things. – Rolf Mertig Dec 21 '11 at 09:19
  • Could I read in a chunk of data with LibraryLink code and swap it out to disk by command? Right now I need to Get / Clear the same MX file multiple times and basically I want to speed that up. – Rolf Mertig Dec 21 '11 at 09:23
  • I have never done this, so I am a little cautious but I think this should be possible. –  Dec 21 '11 at 09:54
  • Before coding in C, though, I'd make sure that the file reading is the bottle neck, perhaps the the optimization can be tweaked as well. –  Dec 21 '11 at 09:55
  • @ruebenko Thanks for sharing that this is possible at all! I didn't know we could write custom importers. – Szabolcs Dec 21 '11 at 09:59
  • @Szabolcs, yes, this is important with the millions of formats that exist such that the import/export can be extended. –  Dec 21 '11 at 10:09