0

I am having major troubles with struct.unpack in python. I have a binary file with a pre-determined format, that can either be written in MATLAB or in Python.

I can write binary data to a file in Python and read the data back with no issues. I can also write the same data to a binary file from MATLAB and read it back in MATLAB with no problem.

My problem comes when I either write the data from MATLAB and try to read it back in Python, or when I write the data in Python and try to read it back in MATLAB.

For simplicity, let's say I'm writing two integers to a binary file (big-endian). Each integer is 4 bytes. The first integer is a valid integer not greater than 4 bytes, and the second integer must be equal to either 1, 2, or 3.

First, here is how I write my data in MATLAB:

fid=fopen('hello_matlab.test','wb');
first_data=4+4;
second_data=1;

fwrite(fid,first_data,'int');
fwrite(fid,second_data,'int');

fclose(fid);

And here is how I read that back in MATLAB:

fid=fopen('hello_matlab.test','rb');
first_data=fread(fid,1,'int');
second_data=fread(fid,1,'int');

fprintf('first data: %d\n', first_data);
fprintf('second data: %d\n', second_data);

fclose(fid);

    >> first data: 8
    >> second data: 1

Now, here is how I write the data in Python:

fid=open('hello_python.test','wb')
first_data=4+4
second_data=1

fid.write(struct.pack('>i',first_data))
fid.write(struct.pack('>i',second_data))

fid.close()

And here is how I read that data back in python. Also note, the commented out portion worked (when reading from files written in Python). I originally thought something weird was happening with the way struct.calcsize('>i') was being calculated, so I removed it and instead put a hard-coded constant, INTEGER_SIZE, to represent the amount of bytes I knew MATLAB had used when encoding it:

INTEGER_SIZE=4

fid=open('hello_python.test','rb')

### FIRST WAY I ORIGINALLY READ THE DATA ###
# This works, but I figured I would try hard coding the size
# so the uncommented version is what I am currently using.
#
# first_data=struct.unpack('>i',fid.read(struct.calcsize('>i')))[0]
# second_data=struct.unpack('>i',fid.read(struct.calcsize('>i')))[0]

### HOW I READ DATA CURRENTLY ###
first_data=struct.unpack('>i',fid.read(INTEGER_SIZE))[0]
second_data=struct.unpack('>i',fid.read(INTEGER_SIZE))[0]

print "first data: '%d'" % first_data
print "second data: '%d'" % second_data

fid.close()

    >> first data: 8
    >> second data: 1

Now, lets say I want to read hello_python.test in MATLAB. With my current MATLAB code, here is the new output:

>> first data: 419430400
>> second data: 16777216

That is strange, so I did the reverse. I looked at what happens when I read hello_matlab.test. With my current Python code, here is the new output:

>> first data: 419430400
>> second data: 16777216

So, something weird is happening but I don't know what it is. Also note, although this is part of a larger project, I did just extract these parts of my code into a new project, and tested the example above with those results. I'm really confused on how to make this file portable :( Any help would be appreciated.

Alex
  • 2,145
  • 6
  • 36
  • 72
  • I don't see anything in the MATLAB code that would indicate you're writing the values in big-endian format; I suspect they're being written in little-endian format, so when reading using the Python code, you'll want ` – Mark Dickinson Aug 21 '16 at 09:47
  • Also, after writing the entire MATLAB file, what's the result of opening that file up in Python and doing a simple `fid.read()` to read the entire contents? – Mark Dickinson Aug 21 '16 at 09:48
  • Does this help? http://stackoverflow.com/questions/874461/read-mat-files-in-python – cdarke Aug 21 '16 at 10:07

2 Answers2

1

You may be intrested in pandas hdf5 store:

In Python:

In [418]: df_for_r = pd.DataFrame({"first": np.random.rand(100),
   .....:                          "second": np.random.rand(100),
   .....:                          "class": np.random.randint(0, 2, (100,))},
   .....:                          index=range(100))
   .....: 

In [419]: df_for_r.head()
Out[419]: 
   class     first    second
0      0  0.417022  0.326645
1      0  0.720324  0.527058
2      1  0.000114  0.885942
3      1  0.302333  0.357270
4      1  0.146756  0.908535

In [420]: store_export = HDFStore('export.h5')

In [421]: store_export.append('df_for_r', df_for_r)

In [422]: store_export
Out[422]: 
<class 'pandas.io.pytables.HDFStore'>
File path: export.h5
/df_for_r            frame_table  (typ->appendable,nrows->100,ncols->3,indexers->[index])

In matlab:

data = h5read('export.h5','/df_for_r');

But Im not sure if it works, wrote completely in browser...

yourstruly
  • 972
  • 1
  • 9
  • 17
  • We originally used netcdf and wanted to transition to our own binary file format so that it would be relatively language independent :( Unfortunately I have to stick to that set of guidelines. My fault, I should have put that in the original description. – Alex Aug 21 '16 at 09:30
  • How about making it with c++? Like put data-to-be-writed in c and then save it, read it with c, put somewhere elese? C is universal I think... I was not playing with that topic, but I would start from c for that :) Easiest would be to use plain csv files... Most efficient in storage cap would be with pandas hdf5 store with compression... Most efficient in speed I dknow xD... – yourstruly Aug 21 '16 at 09:47
  • Yeah.. I think I might just save myself the trouble and start writing it in C. Thanks for everything! – Alex Aug 21 '16 at 09:53
  • HDF5 is pretty language-independent. Most major numerical languages support it, and you are going to be writing C code anyway then you could link to the HDF5 C, C++, or Fortran library in your language of choosing. – TheBlackCat Aug 21 '16 at 15:10
1

The issue is with endianness, the order of bits in a number. You must be on an x86 or x86-64 computer (since those are the only ones MATLAB supports), and those are little-endian. However, the python >i is telling it to use big-endian byte order. So you are using opposite byte orders, which is making the two languages read completely different numbers out.

If you only ever plan on using the Python code on an x86 or x86-64 computer, or you only care about sending data between MATLAB and Python on the same computer, then you can just leave off the byte order mark completely and use the native byte order (so i instead of >i). If you may be running on python on a powerpc system you might want to manually-specify little-endianess (<i).

For this example, that appears to be the only issue. I would like to point out that if you are trying to read and write arrays/matrices of data a time, then numpy.fromfile will be much faster and easier.

TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • If with numpy just dont use np.loadtxt. Looking by speed it would be np.fromfile>np.load>pd.read_csv>>np.loadtxt (based on http://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file). – yourstruly Aug 21 '16 at 17:02
  • The issue did appear to be related to endianness. I originally thought it would be ok since MATLAB claims that specifying `fwrite(...,'int')` means that `int = 4 bytes`. However, after adding the `ieee-be` formatter, it did seem to clear the problem up. – Alex Aug 22 '16 at 03:47
  • @Alex: why do you want it big-endian? Since pretty much any computer you are going to use these days is either little-endian our dual-endian, you are just going to be adding unnecessary overhead by making your data big-endian. – TheBlackCat Aug 22 '16 at 11:18
  • The original files that my binary format was replacing was netcdf written in classic format, which uses [big endian](http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg11257.html). Since I figured I would have to translate those files over eventually, I thought I would make it easier on myself and read/write in big-endian. – Alex Aug 23 '16 at 08:08