4

I'm having trouble reading an unformatted F77 binary file in Python. I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...

The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).

open (unit=21, file=bendnm, status='new'
.     ,form='unformatted')
write (21) bcube1
close (unit=21)

I can successfully read it in IDL using the following (also not my code, copied from colleague):

bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun

The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).

Now I try to get the same effect using Python, but no luck. I've tried the following methods.

In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')

which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.

Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).

In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030,   4.97284231e-020,  -1.06514594e+299, ...,
         8.97359707e-029,   6.79921640e-316,  -1.79102266e-037])

I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.

It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?

NoMansEyes
  • 41
  • 1
  • 5
  • 1
    Did you have a look at e.g. https://stackoverflow.com/questions/37534220/python-read-fortran-binary-file ( found by google; python reading Fortran binary file) – albert Dec 05 '18 at 19:05
  • Please use tag [tag:fortran] for all Fortran questions. Your question is not version specific anyway. – Vladimir F Героям слава Dec 05 '18 at 21:11
  • @VladimirF I'm sorry if my question wasn't clear. Perhaps I can rephrase. Why does `np.fromfile(fname)` return more values than are in the array? In may case, there should be 3090903 entries, but the result has 3090904 entries. And why are the values that it returns not equal to the values in the source array? – NoMansEyes Dec 05 '18 at 22:27
  • @albert Yes I've looked at that post. It addresses having the wrong datatype for the content of the array. However, I know that the data in the array is r8, so I know that the python datatype should be float64. – NoMansEyes Dec 05 '18 at 22:29
  • No worries. I was only referring to the tag fortran vs tag fortran77, not to the clarity of your question. – Vladimir F Героям слава Dec 05 '18 at 22:39
  • I don't know about Fortran. But for `np.fromfile`, it reads raw data without any header or footer. If you file has header (and hopefully you know how long is it), I believe you can skip the header by passing a `open` and `seek`ed file to `np.fromfile`. If your file has footer, use the `count` parameter of `np.fromfile`. If you are not sure how long is header/footer. Do a experiment with `0.` and observe the file using hex editor – ZisIsNotZis Dec 06 '18 at 05:27

2 Answers2

3

I played a bit around with it, and I think I have an idea.

How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:

  1. The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
  2. The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.

    The header usually stores the length of the record in bytes, and is repeated at the end.

    Then again, it is not standardized, so no guarantees.

  3. The endianness, little or big.

    Technically for both header and values, but I assume they're the same.

    Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.

When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:

from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')

When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:

vals = ff.read_reals('>f8')

Look here for a description of the syntax of the data type.

If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.

chw21
  • 7,970
  • 1
  • 16
  • 31
  • 1
    That's fixed it! Thank you so much! For any future readers, the solution is to use the datatypes as defined by The Array Interface. TLDR; '>u4' and '>f8' *imply* bigendian 'uint32' and 'float64' but are specific to the C/F API. – NoMansEyes Dec 06 '18 at 10:35
0

Fortran has record demarcations which are poorly documented, even in binary files.

So every write to an unformatted file:

integer*4 Test1
real*4 Matrix(3,3)

open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)

Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)

The above could be read in Python via numpy as:

input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)

Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:

Matrix = np.transpose(data[0]['MatrixT']

Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.

Mouse
  • 11
  • 1