0

I am trying to write a dictionary to a .mat file using scipy.io.savemat(), but when I do, the contents change!

Here is the array I wish to assign to the dictionary key "Genes":

vectorizeddf.index.values.astype(np.str_)

Which prints as

array(['44M2.3', 'A0A087WSV2', 'A0A087WT57', ..., 'tert-rmrp_human',
       'tert-terc_human', 'wisp3 varinat'], 
      dtype='<U44')

Then I do

genedict = {"Genes": vectorizeddf.index.values.astype(np.str_), 
         "X": vectorizeddf.values, 
         "ID": vectorizeddf.columns.values.astype(np.str_)}
import scipy.io as sio
sio.savemat("goa_human.mat", genedict)

But when I load the dictionary using

goadict = sio.loadmat("goa_human.mat")

My strings get padded with spaces!

>>> goadict['Genes']
array(['44M2.3                                      ',
   'A0A087WSV2                                  ',
   'A0A087WT57                                  ', ...,
   'tert-rmrp_human                             ',
   'tert-terc_human                             ',
   'wisp3 varinat                               '], 
  dtype='<U44')

Which is far from ideal. On the other hand, when I access

genedict['ID']

I get

array(['GO:0000002', 'GO:0000003', 'GO:0000009', ..., 'GO:2001303',
       'GO:2001306', 'GO:2001311'], 
     dtype='<U10')

Which is the original format of the array before saving. It seems to me that the issue is in the dtype, but I did my best to cast both of them as strings. I am not sure why one is <U44 and the other is <U10. How might I resolve this?

Thank you!

Hudson Cooper
  • 75
  • 1
  • 8
  • Why are you showing us `genedict['ID']`? Show us `genedict['Genes']`. Compare the same key. Different keys will have different `dtypes`. – hpaulj Mar 02 '16 at 02:29
  • @hpaulj implicitly he did, with the first three blocks of code – zeeMonkeez Mar 02 '16 at 02:40
  • The issue is that in order to save cell arrays of strings, the array should be constructed with `np.asarray(['abc', 'def', 'ghi'], dtype='object')`. – zeeMonkeez Mar 02 '16 at 02:43
  • I don't think he's worried about the data structure in MATLAB (cell, structure, etc). He's just writing and reading in Python. As far as I can tell the 'Genes' value is saved as 'U44' and loaded as the same. – hpaulj Mar 02 '16 at 02:46
  • http://stackoverflow.com/questions/35706697/saving-numpy-structure-array-to-mat-file - yesterday's `savemat` question. – hpaulj Mar 02 '16 at 02:49
  • @hpaulj except he's writing a `mat` file, which imposes MATLAB's constraints, where char arrays are space padded. The correct way to save strings in `mat` is with cell arrays. – zeeMonkeez Mar 02 '16 at 03:12
  • Yes, I just went through a reinvented the answers in your link. – hpaulj Mar 02 '16 at 06:35
  • My bad! goadict['ID'] printed the same as genedict['ID']. I just pasted the wrong one. – Hudson Cooper Mar 03 '16 at 02:56

1 Answers1

0

Let's try to save a variety of objects:

In [597]: d={'alist':['one','two','three','four'],
   .....: 'adict':{'one':np.arange(5)},
   .....: 'strs': np.array(['one','two','three','four']),
   .....: 'objs': np.array(['one','two','three','four'],dtype=object)}

In [598]: d
Out[598]: 
{'alist': ['one', 'two', 'three', 'four'],
 'adict': {'one': array([0, 1, 2, 3, 4])},
 'objs': array(['one', 'two', 'three', 'four'], dtype=object),
 'strs': array(['one', 'two', 'three', 'four'], 
       dtype='<U5')}
In [599]: io.savemat('test.mat',d)

In [600]: dd=io.loadmat('test.mat')
In [601]: dd
Out[601]: 
{'adict': array([[([[0, 1, 2, 3, 4]],)]], 
       dtype=[('one', 'O')]),
 'strs': array(['one  ', 'two  ', 'three', 'four '], 
       dtype='<U5'),
 'alist': array(['one  ', 'two  ', 'three', 'four '], 
       dtype='<U5'),
 '__header__': b'MATLAB 5.0....',
 '__version__': '1.0',
 'objs': array([[array(['one'], 
       dtype='<U3'),
         array(['two'], 
       dtype='<U3'),
         array(['three'], 
       dtype='<U5'),
         array(['four'], 
       dtype='<U4')]], dtype=object),
 '__globals__': []}

This is for scipy version, '0.14.1'; not a particularly new one, but I haven't read of recent changes in this io code.

And in Octave I get:

octave:14> data = load('test.mat')
data =

  scalar structure containing the fields:

    alist =

one  
two  
three
four 

    adict =

      scalar structure containing the fields:

        one =

          0  1  2  3  4


    objs = 
    {
      [1,1] = one
      [1,2] = two
      [1,3] = three
      [1,4] = four
    }
    strs =

one  
two  
three
four 

The list and str array both produce (4,5) character arrays in Octave, while the dtype=object array produces the cell array of strings.

In both d and dd, the strs array is U5 and takes up 80 bytes (4 words*5 char/word *4 bytes/char), but in dd, the strings have been padded with blanks.

In [617]: d['strs'][0]
Out[617]: 'one'
In [618]: dd['strs'][0]
Out[618]: 'one  '
In [619]: d['strs'][0].tostring()
Out[619]: b'o\x00\x00\x00n\x00\x00\x00e\x00\x00\x00'
In [620]: dd['strs'][0].tostring()
Out[620]: b'o\x00\x00\x00n\x00\x00\x00e\x00\x00\x00 \x00\x00\x00 \x00\x00\x00'

I haven't paid attention as to why arrays like d['strs'] don't display the strings with padding. Some how it's distinguishing between blanks and 'empty' bytes. Note this is with Py3, where the default string is unicode. I don't know if the Py2 byte strings are different (except they do take up 1 byte/char).

So yes, io.savemat does change the string array (and lists) by adding blanks to the full dtype width. The purpose is to create a MATLAB style character matrix.

@zeeMonkeez's link covers this, including a way of converting the character matrix to cell:

octave:25> cellstr(data.strs)
ans = 
{
  [1,1] = one
  [2,1] = two
  [3,1] = three
  [4,1] = four

Python to MATLAB: exporting list of strings using scipy.io

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353