Python reading unformatted direct access Fortran 90 gives incorrect output

Question

Here is how the data is written (its a 2-D matrix of floats. I am not sure the size).

open(unit=51,file='rmsd/'//nn_output,form='unformatted',access='direct',status='replace',&
     recl=Npoints*sizeofreal)

!a bunch of code omitted 

    write(51,rec=idx-nstart+1) (real(dist(jdx)),jdx=1,Npoints)

Here is how I am trying to read the file, inspired from the answer to these similar questions: 1, 2.

   f = open(inputfilename,'rb')
   field = np.fromfile(f,dtype='float64')

But the result is not correct. The diagonal of the matrix should be 0 (or very close) as it is a similarity matrix. I tried different dtypes but I still could not get a correct result.

EDIT: here is the output I get

array([  1.17610188e+01,   2.45736970e+02,   7.79741823e+02, ...,
         9.52930627e+05,   8.93743127e+05,   7.64186127e+05])

I haven't bothered with reshaping yet, as the values should range from 0 to ~20.

EDIT 2: Here is a link to the first 100 lines of the file: https://drive.google.com/file/d/0B2Mz7CoRS5g5SmRxTUg5X19saGs/view?usp=sharing

EDIT 3: Declarations at top of file

integer,parameter :: real_kind=8
!valuables for MPI
integer :: comm, nproc, myid, ierr

integer,allocatable :: idneigh(:)
real :: tmp
real,allocatable :: traj(:,:)
real(real_kind),allocatable :: dist(:)
real(real_kind),allocatable :: xx(:,:),yy(:,:),rot(:),weight(:)
character(200) :: nn_traj, nn_output, nn_neigh
integer,parameter :: sizeofreal=4,sizeofinteger=4

How did you reshape it? Did you use the Fortran column major order? How does your incorrect result look like? — Vladimir F Героям слава, Sep 02 '15 at 20:22
you are writing reals of the default kind and reading double precision reals. Unless you are overriding the default real kind, it is single precision (`float32` in python). What is the value of `sizeofreal`? What is your python output and what is your expected output? Which compiler are you using and are there any options affecting record sizes (e.g. the default unit of recl is different in gfortran and ifort) — casey, Sep 02 '15 at 20:41
if `Npoints` is small, can you post a hexdump of the output file? If `Npoints` is large, you could post the beginning of the file and a few lines in. — casey, Sep 02 '15 at 20:48
@casey `sizeofreal` is 4 and I am using mpiifort to compile. I've posted the declarations at the top of the file and some additional info. — rohanp, Sep 02 '15 at 22:27

casey · Accepted Answer · 2015-09-06T22:17:40.137

Unformatted Fortran binary files encode record length to delimit the records. In general the record length will be written both before and after each record, though if memory serves the details of that are processor dependent (See the second half of the post if records are not delimited in this way). Looking at the file you posted, if you interpret the first 4 bytes as an integer and the remaining bytes as 32 bit floating point values you get:

0000000        881505604    7.302916e+00    8.723415e+00    6.914254e+00
0000020     9.826199e+00    7.044637e+00    8.601265e+00    6.629045e+00
0000040     6.103047e+00    9.476192e+00    9.326468e+00    6.535160e+00
0000060     8.904651e+00    4.710213e+00    6.534080e+00    1.156603e+01
0000100     1.046533e+01    9.343380e+00    8.574672e+00    7.498291e+00
0000120     1.071538e+01    7.138038e+00    5.898036e+00    6.182026e+00
0000140     7.037515e+00    6.418780e+00    6.294755e+00    8.327971e+00
0000160     6.796582e+00    7.397069e+00    6.493272e+00    1.126087e+01
0000200     6.467663e+00    7.178994e+00    7.867798e+00    5.921878e+00

If you seek past this record length field you can read the rest of the record into a python variable. You will want to specify the correct number of bytes because there will be another record length at the end of the record and that will read in as a wrong value into your array. 881505604 implies your NPoints is 220376401 (if this is not true, then see the second half of the post, this may be data and not a record length).

You can read this data in python with:

f = open('fortran_output', 'rb')
recl = np.fromfile(f, dtype='int32', count=1)
f.seek(4)
field = np.fromfile(f, dtype='float32')

print('Record length=',recl)
print(field)

This reads the record length, seeks 4 bytes forward and reads the rest of the file into a float32 array. You will want to, as mentioned before, specify a proper count= to the read to not intake the end record field.

This program outputs the following for your input file:

Record length= [881505604]
[ 7.30291557  8.72341537  6.91425419 ...,  6.4588294   6.53710747
  6.01582813]

All of the 32 bit reals in the array are between 0 and 20 as you suggest is proper for your data, so this looks like what you want.

If, however, your compiler does not encode record length in delimiting records for unformatted, direct access files, then the output is instead interpreted as 32 bit floats is:

0000000    2.583639e-07       7.3029156        8.723415        6.914254
0000020        9.826199        7.044637        8.601265        6.629045
0000040        6.103047       9.4761915        9.326468       6.5351596
0000060        8.904651        4.710213         6.53408       11.566033
0000100       10.465328         9.34338        8.574672        7.498291
0000120       10.715377        7.138038       5.8980355        6.182026
0000140       7.0375147         6.41878       6.2947555       8.3279705
0000160       6.7965817       7.3970685       6.4932723       11.260868
0000200        6.467663        7.178994        7.867798       5.9218783
0000220        6.710998         5.71757       6.1372333        5.809089

where the first value is much smaller toward zero, which may be appropriate given you say the matrix diagonal should be 0.

To read this you can do the read in python just as you were attempting, though it would be prudent to use count= to only read in the proper record length if there is more than one record in the file.

Using the python code:

f = open('fortran_output', 'rb')
field = np.fromfile(f, dtype='float32')

print(field)

produces the output

[  2.58363912e-07   7.30291557e+00   8.72341537e+00 ...,   6.45882940e+00
   6.53710747e+00   6.01582813e+00]

which matches the file output interpreted as 32 bit floats.

The particulars of the writing are generally with no record delimiter for this kind of output when tested with various gfortran and ifort versions, but may be like the first half of the post or something different for other compilers.

I also want to re-iterate the caveat that this solution is not with 100% confidence since we do not have the data that you wrote into this file to verify. You do have that information though, and you should verify the data produced is proper. It is also worth noting when compiling with default flags, ifort record lengths are in 4 byte words, while gfortran is in 1 byte units. This won't cause a problem in output but if not compensated for in ifort, your files will be 4 times bigger than needed. You can get record lengths in byte units by using -assume byterecl with ifort.

Yeah, the format is very processor dependent. AFAIK gfortran does not delimit records in direct access files, they look like stream files. — Vladimir F Героям слава, Sep 03 '15 at 07:36
@VladimirF precisely the reason I avoid the headache and use hdf5 for all my IO. — casey, Sep 03 '15 at 13:47
intel also does not use headers with direct access. Actually Ive never seen otherwise over the years, but I don't know what the standard actually says on the matter. Another thing to watch for, the `recl=` unit may not be, ( or perhaps usually is not ) single bytes but 4 byte units. — agentp, Sep 03 '15 at 18:40
@agentp Its possible that first 32 bits isn't a header, but as a 32 bit real it is on the order of 1e-7 and doesn't match his expected data. Of course, without knowing his `Npoints` or what his data actually is there is no 100% way to tell for sure. — casey, Sep 03 '15 at 18:49

Python reading unformatted direct access Fortran 90 gives incorrect output

1 Answers1

Linked