Import big files/arrays with mathematica

Question

I work with mathematica 8.0.1.0 on a Windows7 32bit platform. I try to import data with

Import[file,”Table”]

which works fine as long as the file (the array in the file) is small enough. But for bigger files(38MB)/array(9429 times 2052) I get the message:

No more memory available. Mathematica kernel has shut down. Try quitting other applications and then retry.

On my Windows7 64bit platform with more main memory I can import bigger files, but I think that I will have there the same problem one day when the file has grown/the array has more rows.

So, I try to find a solution to import big files. After searching for some time, I have seen here a similar question: Way to deal with large data files in Wolfram Mathematica. But it seems that my mathematica knowledge is not good enough to adapt the suggested OpenRead, ReadList or similar to my data (see here the example file). The problem is that I need for the rest of my program information of the array in the file, such as Dimensions, Max/Min in some columns and rows, and I am doing operations on some columns and every row. But when I am using e.g. ReadList, I never get the same information of the array as I have got with Import (probably because I am doing it in the wrong way).

Could somebody here give me some advice? I would appreciate every support!

Are you really saying you get this error when reading a file that is only 38 MB? — Nasser, Sep 23 '11 at 08:44
@Nasser This is a special compact format for tabular data, sort of a binary format. It is no surprise that, when unpacked into Mathematica's symbolic data structures (lists), it needs much more memory. However, this does not explain such dramatic memory requirements either - see my answer below. — Leonid Shifrin, Sep 23 '11 at 09:50
I had no problem whatsoever importing this test file. The ByteCount of the data once read in is 619,900,248, so about 620 MB. That may cause your problems. — Sjoerd C. de Vries, Sep 23 '11 at 09:50
@Sjoerd I did have problems - at some point it requested more than 2GB of RAM, which you can see by using a system monitor. Since I am on a 64 bit system with 6Gb of RAM, this wasn't fatal. But I have no problem believing that this can be a killer for a 32 bit system running some other stuff as well. IMO, this is a definite flaw in the current `Import` implementation - see my answer for an alternative. As to the `ByteCount`, it is also deceiving, since the final table actually occupies about 3 times less memory than indicated by `ByteCount`. — Leonid Shifrin, Sep 23 '11 at 09:56
@Leonid You're right, there's indeed a huge peak in memory usage during the `Import` process. — Sjoerd C. de Vries, Sep 23 '11 at 10:10
@Leonid Is there a way to get *real* amount of memory taken by some *Mathematica* expression? I have experimented a bit with `ByteCount` and found that this function gives confusing results. — Alexey Popkov, Sep 23 '11 at 12:49
@Alexey I think, the explanation for the case at hand is that `ByteCount` assumes that integers are `32` bytes, whereas the system is able to detect that the numbers are small and store them as `8`-byte integers (machine-size ?) instead (this probably happens when we use `ImportString` with the `"Table"` format). Therefore, instead of reporting about `160Mb` for the table, it reports about `640Mb` (the table is sparse and vastly dominated by zeros which are integers). This certainly looks like a major inconsistency. And I don't know of a better to measure expression memory method either. — Leonid Shifrin, Sep 23 '11 at 15:39

Leonid Shifrin · Accepted Answer · 2011-09-26T06:48:56.020

For some reason, the current implementation of Import for the type Table (tabular data) is quite memory - inefficient. Below I've made an attempt to remedy this situation somewhat, while still reusing Mathematica's high-level importing capabilities (through ImportString). For sparse tables, a separate solution is presented, which can lead to very significant memory savings.

General memory-efficient solution

Here is a much more memory - efficient function:

Clear[readTable];
readTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{str, stream, dataChunk, result , linkedList, add},
      SetAttributes[linkedList, HoldAllComplete];
      add[ll_, value_] := linkedList[ll, value];           
      stream  = StringToStream[Import[file, "String"]];
      Internal`WithLocalSettings[
         Null,
         (* main code *)
         result = linkedList[];
         While[dataChunk =!= {},
           dataChunk = 
              ImportString[
                 StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
                 "Table"];
           result = add[result, dataChunk];
         ];
         result = Flatten[result, Infinity, linkedList],
         (* clean-up *)
         Close[stream]
      ];
      Join @@ result]

Here I confront it with the standard Import, for your file:

In[3]:= used = MaxMemoryUsed[]
Out[3]= 18009752

In[4]:= 
tt = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"];//Timing
Out[4]= {34.367,Null}

In[5]:= used = MaxMemoryUsed[]-used
Out[5]= 228975672

In[6]:= 
t = Import["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt","Table"];//Timing
Out[6]= {25.615,Null}

In[7]:= used = MaxMemoryUsed[]-used
Out[7]= 2187743192

In[8]:= tt===t
Out[8]= True

You can see that my code is about 10 times more memory-efficient than Import, while being not much slower. You can control the memory consumption by adjusting the chunkSize parameter. Your resulting table occupies about 150 - 200 MB of RAM.

EDIT

Getting yet more efficient for sparse tables

I want to illustrate how one can make this function yet 2-3 times more memory-efficient during the import, plus another order of magnitude more memory-efficient in terms of final memory occupied by your table, using SparseArray-s. The degree to which we get memory efficiency gains depends much on how sparse is your table. In your example, the table is very sparse.

The anatomy of sparse arrays

We start with a generally useful API for construction and deconstruction of SparseArray objects:

ClearAll[spart, getIC, getJR, getSparseData, getDefaultElement, makeSparseArray];
HoldPattern[spart[SparseArray[s___], p_]] := {s}[[p]];
getIC[s_SparseArray] := spart[s, 4][[2, 1]];
getJR[s_SparseArray] := Flatten@spart[s, 4][[2, 2]];
getSparseData[s_SparseArray] := spart[s, 4][[3]];
getDefaultElement[s_SparseArray] := spart[s, 3];
makeSparseArray[dims : {_, _}, jc : {__Integer}, ir : {__Integer}, 
     data_List, defElem_: 0] :=
 SparseArray @@ {Automatic, dims, defElem, {1, {jc, List /@ ir}, data}};

Some brief comments are in order. Here is a sample sparse array:

In[15]:= 
ToHeldExpression@ToString@FullForm[sp  = SparseArray[{{0,0,1,0,2},{3,0,0,0,4},{0,5,0,6,7}}]]

Out[15]= 
Hold[SparseArray[Automatic,{3,5},0,{1,{{0,2,4,7},{{3},{5},{1},{5},{2},{4},{5}}},
{1,2,3,4,5,6,7}}]]

(I used ToString - ToHeldExpression cycle to convert List[...] etc in the FullForm back to {...} for the ease of reading). Here, {3,5} are obviously dimensions. Next is 0, the default element. Next is a nested list, which we can denote as {1,{ic,jr}, sparseData}. Here, ic gives a total number of nonzero elements as we add rows - so it is first 0, then 2 after first row, the second adds 2 more, and the last adds 3 more. The next list, jr, gives positions of non-zero elements in all rows, so they are 3 and 5 for the first row, 1 and 5 for the second, and 2, 4 and 5 for the last one. There is no confusion as to where which row starts and ends here, since this can be determined by the ic list. Finally, we have the sparseData, which is a list of the non-zero elements as read row by row from left to right (the ordering is the same as for the jr list). This explains the internal format in which SparseArray-s store their elements, and hopefully clarifies the role of the functions above.

The code

Clear[readSparseTable];
readSparseTable[file_String?FileExistsQ, chunkSize_: 100] :=
   Module[{stream, dataChunk, start, ic = {}, jr = {}, sparseData = {}, 
        getDataChunkCode, dims},
     stream  = StringToStream[Import[file, "String"]];
     getDataChunkCode := 
       If[# === {}, {}, SparseArray[#]] &@
         ImportString[
             StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]], 
             "Table"];
     Internal`WithLocalSettings[
        Null,
        (* main code *)
        start = getDataChunkCode;
        ic = getIC[start];
        jr = getJR[start];
        sparseData = getSparseData[start];
        dims = Dimensions[start];
        While[True,
           dataChunk = getDataChunkCode;
           If[dataChunk === {}, Break[]];
           ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
           jr = Join[jr, getJR[dataChunk]];
           sparseData = Join[sparseData, getSparseData[dataChunk]];
           dims[[1]] += First[Dimensions[dataChunk]];
        ],
        (* clean - up *)
        Close[stream]
     ];
     makeSparseArray[dims, ic, jr, sparseData]]

Benchmarks and comparisons

Here is the starting amount of used memory (fresh kernel):

In[10]:= used = MemoryInUse[]
Out[10]= 17910208

We call our function:

In[11]:= 
(tsparse= readSparseTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[11]= {39.874,Null}

So, it is the same speed as readTable. How about the memory usage?

In[12]:= used = MaxMemoryUsed[]-used
Out[12]= 80863296

I think, this is quite remarkable: we only ever used twice as much memory as is the file on disk occupying itself. But, even more remarkably, the final memory usage (after the computation finished) has been dramatically reduced:

In[13]:= MemoryInUse[]
Out[13]= 26924456

This is because we use the SparseArray:

In[15]:= {tsparse,ByteCount[tsparse]}
Out[15]= {SparseArray[<326766>,{9429,2052}],12103816}

So, our table takes only 12 MB of RAM. We can compare it to our more general function:

In[18]:= 
(t = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[18]= {38.516,Null}

The results are the same once we convert our sparse table back to normal:

In[20]:= Normal@tsparse==t
Out[20]= True

while the normal table occupies vastly more space (it appears that ByteCount overcounts the occupied memory about 3-4 times, but the real difference is still at least order of magnitude):

In[21]:= ByteCount[t]
Out[21]= 619900248

Strange, the data structure generated by `Import` takes 619,900,216 on my PC according to `ByteCount`, but if I'm measuring using `MaxmemoryUsed` it only takes 181,224,272 bytes. I assume taking the `MaxmemoryUsed` route doesn't give you an accurate impression of the memory need of your data. I mean, with all the garbage collection being done behind your back. — Sjoerd C. de Vries, Sep 23 '11 at 09:57
@Sjoerd Indeed. I have no idea why. Also see my comment above. — Leonid Shifrin, Sep 23 '11 at 09:59
@Sjoerd No, I don't think so. You can also use `MemoryInUse`, as well as system monitor (look at the memory used by MathKernel process). All of them sort of agree, but disagree with `ByteCount`. I think the data is somehow stored more efficiently than `ByteCount` indicates. And the garbage - collection is a different issue. In my experiments, `MaxMemoryUsed` is more or less accurate in giving you the maximal size of allocated memory, including that. — Leonid Shifrin, Sep 23 '11 at 10:01
Indeed, `Table` is both slow and extremely memory inefficient. For homogeneous (i.e. same data type for each table entry) data I usually use `ReadList`. It's much much faster as well as more memory efficient — Szabolcs, Sep 23 '11 at 11:11
@Sjoerd Regarding the discrepancy between `ByteCount` and `MemoryInUse` / system monitor, please see my reply to the comment by Alexey above, for a possible explanation. — Leonid Shifrin, Sep 23 '11 at 15:40
Since the use of your memory efficient alternatives are likely to be long running (lots of stuff to load), might I suggest wrapping them in `CheckAbort` to ensure that the `Stream` is closed properly if the operation is aborted? [Here's an implementation](http://stackoverflow.com/questions/7525782/import-big-files-arrays-with-mathematica) of what I use to make the functionality easier to add. — rcollyer, Sep 23 '11 at 16:38
@rcollyer Thanks, this crossed my mind. I did not include it here partly to keep things simpler, but this is a great idea. Let me see if we can do it unintrusively with the help of macros. By the way, I think you used the wrong link - to this very page - in your comment. — Leonid Shifrin, Sep 23 '11 at 16:43
@Leonid, that's why I use `OpenAndRead`. It only requires a single extra line. — rcollyer, Sep 23 '11 at 16:45
I'm sorry, I hadn't realized that I had linked back to this question. Here's the [correct link](http://stackoverflow.com/questions/4174791/preventing-avalanche-of-runtime-errors-in-mathematica/4176381#4176381). — rcollyer, Sep 23 '11 at 16:47
@rcollyer Actually, it is not enough to protect against aborts. One has to also protect against exceptions, which must then be re-thrown after the code finishes. So, it is more complicated. Fortunately for us, this function (macro) already exists: it is ``Internal`WithLocalSettings``, and has been described by Daniel Lichtblau here: http://groups.google.com/group/comp.soft-sys.math.mathematica/browse_thread/thread/a39d99bc1470bb3c . I will use it here, per your suggestion. — Leonid Shifrin, Sep 23 '11 at 16:51
True, but I have so rarely encountered exceptions that it didn't seem to matter. — rcollyer, Sep 23 '11 at 16:55
@Szabolcs Yes, sure, this is how I understood your "Table". I totally agree with both your comment and suggestion. I actually used `ReadList` here as well. — Leonid Shifrin, Sep 23 '11 at 17:18
`WithLocalSettings` works here, but it has limitations in general use. It doesn't, for example, handle `Throw` and `Catch` properly (e.g. ``Catch[Internal`WithLocalSettings[Print["start"], Throw[23], Print["cleanup"]]]``). The clean-up happens, but the "catch" is disrupted. Handling stack unwinds in MMa is a tricky business. See [Reliable clean-up in Mathematica](http://stackoverflow.com/q/3365794/211232). — WReach, Sep 23 '11 at 21:56
@WReach Ok, good point. I have in my toolset a function similar to your `Cleanup` - just did not want to add more code, things are complex enough here already. Will either use yours, and link, or use mine and post the code. The morale is that everything should be checked - I took the ``Internal`WithLocalSettings `` correctness for granted and did not test extensively - and got in trouble. Now I see why I subconsciously did not want to add this part in a hurry - I knew it should be done thoroughly. — Leonid Shifrin, Sep 23 '11 at 22:20
Wow, I liked your old answer much, but the new, edited answer is really impressive! I also recognized that most of my arrays are very sparse, but I was not able to make use of this because my programming skills are very low at the moment. Therefore, I also have/had some problems to understand your old/new complex solution (Thanks a lot for the comments, they are really helpful!). But I am very happy that it works very fine for computers with little RAM (I think Mathematica should replement their Import function for Tables with your solutions). — partial81, Sep 26 '11 at 07:07
I also want to say thanks to Sjoerd, Szabolcs, rcollyer, and WReach for their fruitful comments! Thanks goes also to @Mr.Wizard for correcting my English mistakes. — partial81, Sep 26 '11 at 07:10
@partial81 The basic idea is to read the file in chunks, transform those chunks into parts of the table, and accumulate them, gluing them together at the end. For the second solution, table chunks are converted to `SparseArray` object. To keep things efficient, we go more low-level there, and merge individual constituents of the `SparseArray`. Luckily, the structure of the problem makes this straightforward. The main point of both solutions is to avoid loading the entire file into memory and converting to a full table at once - since this is what seems to cause huge memory consumption. — Leonid Shifrin, Sep 26 '11 at 07:43