7

I have very large .mat file (~ 1.3 GB) that I am trying to load in my Python code (IPython notebook). I tried:

import scipy.io as sio
very_large = sio.loadmat('very_large.mat')

And my laptop with 8 GB RAM hangs. I kept system monitor open and saw that the memory consumption steadily increases to 7 GB and then the system freezes.

What am I doing wrong? Any suggestion / work around?

EDIT:

More details on the data: Here is the link to the data: http://ufldl.stanford.edu/housenumbers/

The particular file of my interest is extra_32x32.mat. From the description : Loading the .mat files creates 2 variables: X which is a 4-D matrix containing the images, and y which is a vector of class labels. To access the images, X(:,:,:,i) gives the i-th 32-by-32 RGB image, with class label y(i).

So for example a smaller .mat file from the same page (test_32x32.mat) when loaded in the following way:

SVHN_full_test_data = sio.loadmat('test_32x32.mat')
print("\nData set = SVHN_full_test_data")
for key, value in SVHN_full_test_data.iteritems():
    print("Type of", key, ":", type(SVHN_full_test_data[key]))
if str(type(SVHN_full_test_data[key])) == "<type 'numpy.ndarray'>":
    print("Shape of", key, ":", SVHN_full_test_data[key].shape)
else:
    print("Content:", SVHN_full_test_data[key])

produces:

Data set = SVHN_full_test_data
Type of y : <type 'numpy.ndarray'>
Shape of y : (26032, 1)
Type of X : <type 'numpy.ndarray'>
Shape of X : (32, 32, 3, 26032)
Type of __version__ : <type 'str'>
Content: 1.0
Type of __header__ : <type 'str'>
Content: MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Mon Dec  5 21:18:15 2011
Type of __globals__ : <type 'list'>
Content: []
Community
  • 1
  • 1
user42388
  • 183
  • 1
  • 5
  • You are running out of memory. There isn't much that you can do about it, except for expanding your memory to >8gb or shrink the file in some way. – Kevin K. Aug 25 '16 at 19:39
  • 3
    Do you need *everything* from the `mat` file? If no, try to load the required variables by specifying `variable_names = ['varname1', 'varname2']`. – Jørgen Aug 25 '16 at 19:57
  • @ Jorgen. I do need all the variables (columns). However not all the rows. So may be a way to only select the first few rows? – user42388 Aug 25 '16 at 20:07
  • YOur mention of variables, columns and rows doesn't make sense. I'd suggest giving us a description of the file contents, as seen by MATLAB (variable names, type (matrix, cell, struct) and sizes. You may also need to reread the `loadmat` documentation. – hpaulj Aug 25 '16 at 20:10
  • @hpaulj Added details – user42388 Aug 25 '16 at 20:24
  • It may be instructive for all of us if you tried to load one of the smaller variables from the big file. In your sample `y` is much smaller than `x` (assuming `dtype` is the same). – hpaulj Aug 25 '16 at 20:52
  • 1
    Matlab can potentially load small chunks of a file by creating an [interface to a mat file](http://www.mathworks.com/help/matlab/import_export/load-parts-of-variables-from-mat-files.html). You may have to write a wrapper to do so in python though. – zglin Aug 25 '16 at 21:25
  • you can divide the file into small parts e.g each 100 MB by reading and writing to other mat files. – M.Hassan Aug 26 '16 at 00:50

1 Answers1

2

This answer is dependent on two assumptions:

  • The .mat file is saved as MAT version 7.3 (which seems hdf5-compliant, although The MathWorks don't go as far as guaranteeing it), or could be saved via direct write to hdf5 format (with MATLAB's hdfwrite()).

  • You're able to import and use other third party packages in python, namely pandas.

Approach

Given those assumptions, the approach I'd use is:

  1. Ensure the .mat file is saved to an hdf5 compatible form. This might mean converting it using MATLAB's matfile(), which won't load it all to disk, or could be done one-time on a machine with more RAM.

  2. Use pandas to read part of the hdf5-compliant .mat file into a data frame.

  3. Use the data frame for your onward analysis in python.

Notes:

Pandas data frames work very well with numpy and scipy in general. So if you can read your data into a frame, you'll probably be able to do what you want with it from there.

The answer to this SO question shows you how to read only part of an hdf5 datafile into memory (a pandas data frame) at a time, based on a condition (index range, or some logical condition e.g. WHERE something=somethingelse).

Mini-rant

MATLAB has supported its latest version 7.3 MAT files for 12 years now, but still doesn't use that as the standard version to save to (it's a disk space thing, v7.3 are larger in some situations but way more versatile to use) - so anyone using default MATLAB settings won't be generating v7.3 matfiles. 12 years on, we've loads of disk space but this kind of thing still causes problems. It's time to upgrade your default flag, MathWorks!!!!

Hope that helps,

Tom

thclark
  • 4,784
  • 3
  • 39
  • 65