0

I am importing a *.mat file into Python via a script that I found on Stackoverflow.

import h5py

def read_matlab(filename):
    """
    Import *.mat-file.
    
    Source: https://stackoverflow.com/a/58026181/5696601
    """
    print(f"Importing '{filename}' ...")
    
    def conv(path=''):
        p = path or '/'
        paths[p] = ret = {}
        for k, v in f[p].items():
            if type(v).__name__ == 'Group':
                ret[k] = conv(f'{path}/{k}')  # Nested struct
                continue
            v = v[()]  # It's a Numpy array now
            if v.dtype == 'object':
                # HDF5ObjectReferences are converted
                # into a list of actual pointers
                ret[k] = (
                    [r and paths.get(f[r].name, f[r].name) for r in v.flat]
                    )
            else:
                # Matrices and other numeric arrays
                ret[k] = v if v.ndim < 2 else v.swapaxes(-1, -2)
        return ret

    paths = {}
    with h5py.File(filename, 'r') as f:
        return conv()
    
file = read_matlab("test.mat")

I know that the matrix contained in test.mat has the dimension (1134,30807). However, file is a dictionary containing another dictionary with three keys:

file["Y_RMRIO"].keys()
Out[5]: dict_keys(['data', 'ir', 'jc'])

The dictionaries' shapes are as follows:

file["Y_RMRIO"]["data"].shape
Out[11]: (22037784,)

file["Y_RMRIO"]["ir"].shape
Out[12]: (22037784,)

file["Y_RMRIO"]["jc"].shape
Out[13]: (1135,)

How can I import the *.mat file and maintain the matrix's shape of (1134,30807) or turn the imported data into the shape again (e.g. np.array or pd.DataFrame)?

If I get it right, at least one of the dictionaries contains information on the "position" of the data points in the matrix. So I guess the data points could be inserted into an array at the right positions with zeros in-between (or into a np.zeros array with the right dimension). The array could then be reshaped into the desired shape ... ?

Any help is welcome. Many thanks in advance!

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
Stücke
  • 868
  • 3
  • 14
  • 41

1 Answers1

1

This file looks a lot simpler than I expected:

In [1]: import h5py
In [2]: f = h5py.File("../Downloads/test.mat")
In [3]: f.keys()
Out[3]: <KeysViewHDF5 ['Y_RMRIO']>
In [4]: f["Y_RMRIO"]
Out[4]: <HDF5 group "/Y_RMRIO" (3 members)>
In [5]: f["Y_RMRIO"].keys()
Out[5]: <KeysViewHDF5 ['data', 'ir', 'jc']>'

The dtypes are simple (not object):

In [7]: f["Y_RMRIO/data"]
Out[7]: <HDF5 dataset "data": shape (22037784,), type "<f8">
In [8]: f["Y_RMRIO/ir"]
Out[8]: <HDF5 dataset "ir": shape (22037784,), type "<u8">
In [9]: f["Y_RMRIO/jc"]
Out[9]: <HDF5 dataset "jc": shape (1135,), type "<u8">

sampling

In [10]: f["Y_RMRIO/data"][:10]
Out[10]: 
array([4.21597593e+01, 1.35612280e+02, 9.33348907e+02, 4.96704718e+01,
       8.64967748e-01, 1.23079072e+00, 6.43015281e+01, 1.49868605e+01,
       3.12984149e+02, 2.01720297e+01])
In [11]: f["Y_RMRIO/ir"][:10]
Out[11]: array([ 1,  2,  3,  4,  6,  7,  8,  9, 10, 11], dtype=uint64)
In [13]: f["Y_RMRIO/jc"][:10]
Out[13]: 
array([     0,  25021,  46743,  69537,  92648, 117807, 117807, 143254,
       165303, 189014], dtype=uint64)

I wonder if ir and jc are row and column indices of a sparse matrix:

In [15]: f["Y_RMRIO/ir"][:].max()
Out[15]: 30806
In [16]: f["Y_RMRIO/jc"][:].max()
Out[16]: 22037784

I think jc is the indptr attribute, ir the indices of a csc format sparse matrix.

In [17]: from scipy import sparse
In [18]: M = sparse.csc_matrix((f["Y_RMRIO/data"], f["Y_RMRIO/ir"], f["Y_RMRIO/jc"]))
In [19]: M
Out[19]: 
<30807x1134 sparse matrix of type '<class 'numpy.float64'>'
    with 22037784 stored elements in Compressed Sparse Column format>
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Fantastic, thank you! Also thanks for not only providing me with the right answer but also how you got there! M can be turned into a `np.array` via `M.toarray()`. Thanks! – Stücke Feb 23 '22 at 06:36