Imported *.mat file ends up "flat" in Python

Question

I am importing a *.mat file into Python via a script that I found on Stackoverflow.

import h5py

def read_matlab(filename):
    """
    Import *.mat-file.
    
    Source: https://stackoverflow.com/a/58026181/5696601
    """
    print(f"Importing '{filename}' ...")
    
    def conv(path=''):
        p = path or '/'
        paths[p] = ret = {}
        for k, v in f[p].items():
            if type(v).__name__ == 'Group':
                ret[k] = conv(f'{path}/{k}')  # Nested struct
                continue
            v = v[()]  # It's a Numpy array now
            if v.dtype == 'object':
                # HDF5ObjectReferences are converted
                # into a list of actual pointers
                ret[k] = (
                    [r and paths.get(f[r].name, f[r].name) for r in v.flat]
                    )
            else:
                # Matrices and other numeric arrays
                ret[k] = v if v.ndim < 2 else v.swapaxes(-1, -2)
        return ret

    paths = {}
    with h5py.File(filename, 'r') as f:
        return conv()
    
file = read_matlab("test.mat")

I know that the matrix contained in test.mat has the dimension (1134,30807). However, file is a dictionary containing another dictionary with three keys:

file["Y_RMRIO"].keys()
Out[5]: dict_keys(['data', 'ir', 'jc'])

The dictionaries' shapes are as follows:

file["Y_RMRIO"]["data"].shape
Out[11]: (22037784,)

file["Y_RMRIO"]["ir"].shape
Out[12]: (22037784,)

file["Y_RMRIO"]["jc"].shape
Out[13]: (1135,)

How can I import the *.mat file and maintain the matrix's shape of (1134,30807) or turn the imported data into the shape again (e.g. np.array or pd.DataFrame)?

If I get it right, at least one of the dictionaries contains information on the "position" of the data points in the matrix. So I guess the data points could be inserted into an array at the right positions with zeros in-between (or into a np.zeros array with the right dimension). The array could then be reshaped into the desired shape ... ?

Any help is welcome. Many thanks in advance!

Do you jnow wbat is the MATLAB work space? Matrix, cell, struct, other special classes? — hpaulj, Feb 22 '22 at 22:52

score 1 · Accepted Answer · answered Feb 22 '22 at 23:08

This file looks a lot simpler than I expected:

In [1]: import h5py
In [2]: f = h5py.File("../Downloads/test.mat")
In [3]: f.keys()
Out[3]: <KeysViewHDF5 ['Y_RMRIO']>
In [4]: f["Y_RMRIO"]
Out[4]: <HDF5 group "/Y_RMRIO" (3 members)>
In [5]: f["Y_RMRIO"].keys()
Out[5]: <KeysViewHDF5 ['data', 'ir', 'jc']>'

The dtypes are simple (not object):

In [7]: f["Y_RMRIO/data"]
Out[7]: <HDF5 dataset "data": shape (22037784,), type "<f8">
In [8]: f["Y_RMRIO/ir"]
Out[8]: <HDF5 dataset "ir": shape (22037784,), type "<u8">
In [9]: f["Y_RMRIO/jc"]
Out[9]: <HDF5 dataset "jc": shape (1135,), type "<u8">

sampling

In [10]: f["Y_RMRIO/data"][:10]
Out[10]: 
array([4.21597593e+01, 1.35612280e+02, 9.33348907e+02, 4.96704718e+01,
       8.64967748e-01, 1.23079072e+00, 6.43015281e+01, 1.49868605e+01,
       3.12984149e+02, 2.01720297e+01])
In [11]: f["Y_RMRIO/ir"][:10]
Out[11]: array([ 1,  2,  3,  4,  6,  7,  8,  9, 10, 11], dtype=uint64)
In [13]: f["Y_RMRIO/jc"][:10]
Out[13]: 
array([     0,  25021,  46743,  69537,  92648, 117807, 117807, 143254,
       165303, 189014], dtype=uint64)

I wonder if ir and jc are row and column indices of a sparse matrix:

In [15]: f["Y_RMRIO/ir"][:].max()
Out[15]: 30806
In [16]: f["Y_RMRIO/jc"][:].max()
Out[16]: 22037784

I think jc is the indptr attribute, ir the indices of a csc format sparse matrix.

In [17]: from scipy import sparse
In [18]: M = sparse.csc_matrix((f["Y_RMRIO/data"], f["Y_RMRIO/ir"], f["Y_RMRIO/jc"]))
In [19]: M
Out[19]: 
<30807x1134 sparse matrix of type '<class 'numpy.float64'>'
    with 22037784 stored elements in Compressed Sparse Column format>

Fantastic, thank you! Also thanks for not only providing me with the right answer but also how you got there! M can be turned into a `np.array` via `M.toarray()`. Thanks! — Stücke, Feb 23 '22 at 06:36

Imported *.mat file ends up "flat" in Python

1 Answers1