Here's an idea:
You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.
You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray
to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.
Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.
data = RandomReal[1, 10000000];
indexes = Union@RandomInteger[{1, 10000000}, 10000];
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];
data[[#]] & /@ ranges; // Timing
{0.093, Null}
Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition
function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.
Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).
If BinaryReadList
were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:
First make a test file:
In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]
In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}
In[5]:= Close[f]
Open it:
In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]
In[7]:= StreamPosition[f]
Out[7]= 0
Skip the first 5 million entries:
In[8]:= SetStreamPosition[f, 5000000*8]
Out[8]= 40000000
Read 5 million entries:
In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing
Out[9]= {0.609, 5000000}
Read all the remaining entries:
In[10]:= BinaryReadList[f, "Real64"] // Length // Timing
Out[10]= {7.782, 70000000}
In[11]:= Close[f]
(For comparison, Get
usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)
EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor
object (an MTensor
is just a packed Mathematica array, as seen from the C side of Library Link).