0

I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().

This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.

When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:

[array([[array([[0.82374894]]), array([[0.50730055]]),
        array([[0.36721625]]), array([[0.45036349]])],
       [array([[0.26119276]]), array([[0.16843872]]),
        array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]

This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.

I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.

kiliantics
  • 1,148
  • 1
  • 8
  • 11
  • Regardless of the shape/format of the data, what's the problem with building an "adapter" function/class that would convert the information stored in the `.mat` file to whatever the rest of the code expects? Also, consider that you could write a script that pre-processes the files (by e.g. loading them in MATLAB and saving them in the format that the rest of the code expects). The key here is _using the right tool for the job_, which might be writing a simple function that turns whatever you have into whatever you need, instead of wasting time on getting python/MATLAB to work "just right". – Dev-iL May 21 '19 at 14:32
  • Also, it might be worth tagging this with [tag:mat-file] instead of [tag:scipy] (which is the least relevant here, imho) - but it's up to you. One last thing that I think could improve your question - please provide a [mcve], and show us what is the structure you expect to get in MATLAB vs what you're actually getting. – Dev-iL May 21 '19 at 14:35
  • I find it useful to create a sample file at the MATLAB (I use octave), and loadmat to see what the numpy equivalent is. – hpaulj May 21 '19 at 15:12
  • @Dev, `scipy` is the source package for `loadmat` It's a broad category, but I check it regularly. – hpaulj May 21 '19 at 15:22
  • @Dev-iL, I expected this to work "just right" as I assumed `savemat` was an appropriate tool and I could thereby save myself a lot more time than I would spend figuring it all out in matlab/octave, which I am unfamiliar with. @hpaulj I have sample files that work with the matlab code (that I also run with octave), and I can see their structure by loading with scipy's `loadmat`. The files I create also appear to be working fine, apart from the edge cases I mentioned where all subarrays have the same shape. – kiliantics May 21 '19 at 16:03
  • When you make an array from subarrays you can get two very different results. If the subarrays differ in shape, the result will be an object dtype array (or error in some cases). If they match in shape, `np.array` joins them into one higher dimensional array. Is that what's happening in your edge cases? – hpaulj May 21 '19 at 17:00
  • @kiliantics If I understand your problem correctly it can be solved very easily in MATLAB itself. Turning an `N`-dimensional array into a (cell) vector of `N-1`-dimensional arrays is very simple, and only requires using one command ([`num2cell`](https://www.mathworks.com/help/matlab/ref/num2cell.html)). Feel free to stop by the [MATLAB and Octave chatroom](https://chat.stackoverflow.com/rooms/81987/chatlab-and-talktave) if you want to discuss this. – Dev-iL May 22 '19 at 07:39
  • @hpaulj I could argue that `savemat` and hence `scipy` are the wrong tool to use, since they work with old `.mat` files. Newer files use HDF5, as mentioned in the [`loadmat` docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html "See the ""Notes"" section"). I'm not familiar with python enough to say that the `savemat` itself might be the main cause of the OP's problem, but if it _is_, just switching the export tool could solve several issues. If the OP cannot switch tools for whatever reason, then it's a different story. – Dev-iL May 22 '19 at 07:48
  • @Dev-iL, I'm fairly comfortable with the `h5py` interface to HDF5; there's a lot of similarity between `numpy` arrays and `HDF5` datasets. I've deciphered some Octave HDF5 files (using h5py), but I wouldn't recommend it to beginners. Also there seems to be (or have been) differences between MATLAB and Octave in their HDF5 layout. – hpaulj May 22 '19 at 08:02

1 Answers1

0

I'm not quite clear about the problem. Let me try to recreate your case:

In [58]: from scipy.io import loadmat, savemat                               
In [59]: A = np.empty((2,1), object)     
In [61]: A[0,0]=np.arange(4).reshape(2,2)                                    
In [62]: A[1,0]=np.arange(6).reshape(3,2)                                    
In [63]: A                                                                   
Out[63]: 
array([[array([[0, 1],
       [2, 3]])],
       [array([[0, 1],
       [2, 3],
       [4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]                                                        
In [65]: B                                                                   
Out[65]: 
array([[array([[0, 1],
       [2, 3]])],
       [array([[0, 1],
       [2, 3]])]], dtype=object)

As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925

Saving:

In [66]: savemat('foo.mat', {'A':A, 'B':B})                                  

Loading:

In [74]: loadmat('foo.mat')                                                  
Out[74]: 
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
 '__version__': '1.0',
 '__globals__': [],
 'A': array([[array([[0, 1],
        [2, 3]])],
        [array([[0, 1],
        [2, 3],
        [4, 5]])]], dtype=object),
 'B': array([[array([[0, 1],
        [2, 3]])],
        [array([[0, 1],
        [2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]                                                       
Out[75]: 
array([[0, 1],
       [2, 3],
       [4, 5]])

Your problem case looks like it's a object dtype array containing numbers:

In [89]: C = np.arange(4).reshape(2,2).astype(object)                        
In [90]: C                                                                   
Out[90]: 
array([[0, 1],
       [2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})                                       
In [92]: loadmat('foo1.mat')                                                 
Out[92]: 
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
 '__version__': '1.0',
 '__globals__': [],
 'C': array([[array([[0]]), array([[1]])],
        [array([[2]]), array([[3]])]], dtype=object)}

Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.

===

And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:

>> load foo.mat
>> A
A =
{
  [1,1] =

    0  1
    2  3

  [2,1] =

    0  1
    2  3
    4  5

}
>> B
B =
{
  [1,1] =

    0  1
    2  3

  [2,1] =

    0  1
    2  3

}
>> load foo1.mat
>> C
C =
{
  [1,1] = 0
  [2,1] = 2
  [1,2] = 1
  [2,2] = 3
}

Python: Issue reading in str from MATLAB .mat file using h5py and NumPy

is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.

hpaulj
  • 221,503
  • 14
  • 230
  • 353