Associate (.assoc) binary file structure / probing unknown file structure

Question

I have a file that has a .assoc ending, and is apparently a 'binary associated file', though I can find no information on files of that sort online. It is read by fortran and idl, is 49Mb, and I'm trying to read it into python. This is probably too much of an open question, but could anyone suggest a way to probe the file's structure to get some idea of how I could read it?

I know that the file is a map, presumably 2D, of elevations on Mars. And it has a short ascii header:

        7200         3600 MOLA .05 dg/px topo 5/2002

---------------------------------------------------------

header length        14400 bytes

map X size                7200

map Y size                3600

no-data value            30303

maximum value            21197

minimum value            -8204

The map is stored as an INT array with X as

longitude and Y as latitude. The map is assumed to be

global in coverage.

---------------------------------------------------------

Sorry for the ill-formed question, but general advice on how to probe an unknown file type would be much appreciated. Or if you know of this file type then that would be even better!

Here is a snippet of the idl code that reads the file:

ELMAP='elevmap.assoc'
OPENR, ELUN, ELMAP, /GET_LUN
B = ASSOC(ELUN,BYTARR(100))              ; assoc header
HEADER = STRING(B[0])                    ; read the header
NLON = 0                                 ; 'fix' no. of longitudes
NLAT = 0                                 ; 'fix' no. of latitudes
READS,HEADER,NLON,NLAT                   ;  read no. of lons/lats
EXG = NLON/360                           ; longitude scale (pix/deg)
EYG = NLAT/180                           ; latitude scale (pix/deg)
EMAP = ASSOC(ELUN,INTARR(1),14400)

A hexdump of the first 30 bytes (I did "od -H -N 30 elevmap.assoc") looks like this:

0000000          20202020        20202020        30303237        20202020
0000020          20202020        30363320        4f4d2030        0000414c
0000036

A hexdump of the first 30 bytes after the header ("od -H -j 14400 -N 30 elevmap.assoc", please let me know if I'm misunderstanding this) looks like this:

0034100          0e970e93        0ea50e9d        0ea50ea5        0ea50ea5
0034120          0ea50ea5        0ea40ea4        0ea20ea3        00000ea2
0034136

Have you tried looking at it with a hex editor? Do you the IDL or Fortran source code that reads it? — M. S. B., Jul 10 '12 at 01:16
It's 49Mb. I haven't got the Fortran source code that reads it, I've added to my original question the idl code that reads it. I hadn't tried a hex editor, I'll look in to that. Thanks. — EddyTheB, Jul 10 '12 at 13:56

Hristo Iliev · Accepted Answer · 2012-07-10T17:01:01.290

A hexdump of the first 30 bytes (I did "od -H -N 30 elevmap.assoc") looks like this:

0000000          20202020        20202020        30303237        20202020
0000020          20202020        30363320        4f4d2030        0000414c
0000036

These are obviously stored in little endian so you should reverse each byte sequence. Here, I have translated that to ASCII for you:

0000000          20 20 20 20 20 20 20 20 37 32 30 30 20 20 20 20
                  _  _  _  _  _  _  _  _  7  2  0  0  _  _  _  _
0000020          20 20 20 20 20 33 36 30 30 20 4d 4f 4c 41
                  _  _  _  _  _  3  6  0  0  _  M  O  L  A

A hexdump of the first 30 bytes after the header ("od -H -j 14400 -N 30 elevmap.assoc", please let me know if I'm misunderstanding this) looks like this:

0034100          0e970e93        0ea50e9d        0ea50ea5        0ea50ea5
0034120          0ea50ea5        0ea40ea4        0ea20ea3        00000ea2

Header says that there are 7200 x 3600 INT values. How big is INT? If INT is as big as int on most Unix systems (4 bytes) then the total size of the file should be 7200 x 3600 x 4 or almost 99 MiB. You say it is 49 MiB hence either there is a compression being used (unlikely) or INT is 16-bit (more likely). Strong evidence for the latter is what you get from this second od - 0e970e93 looks very much like 0e97 and 0e93 being wrongly concatenated in a 32-bit integer. One would expect from a topographical map that neighbouring values should not change abruptly (except at the borders of some deep vertical trench or a steep mountain wall). This is also consistent with the fact that values are within the range of short int: [-32768, 32767].

With this knowledge in place the above dump should be read as:

0034100          0e93  0e97  0e9d  0ea5  0ea5  0ea5  0ea5  0ea5
                +3731 +3735 +3741 +3749 +3749 +3749 +3749 +3749
0034120          0ea5  0ea5  0ea4  0ea4  0ea3  0ea2  0ea2
                +3749 +3749 +3748 +3748 +3747 +3746 +3746

Now you only have to figure out if X-major or Y-major data storage was used. In my experience most data processing tools follow the Fortran column-major order and if data is DATA(X,Y) then X should be the leading dimension, i.e. consequitive data values should be DATA(1,Y), DATA(2,Y), DATA(3,Y), ..., DATA(1,Y+1), DATA(2,Y+1), etc. You can always plot the data using PIL or any other image manipulation package for Python and see if you get something akin a topographic map or a messy whatever.

If unpacking the data with struct.unpack() as suggested by mgilson, you should be using the h format for signed short integer values:

data = struct.unpack('%dh' % (7200*3600), f.read(7200*3600*2))

For completeness I'll post my own answer with the code that read the file for me. — EddyTheB, Jul 10 '12 at 18:11

score 2 · Answer 2 · edited May 23 '17 at 12:12

2

Here are some questions that discusses how Fortran writes unformatted files: Fortran unformatted file format or Reading a direct access fortran unformatted file in Python. To know where the extra bytes are you have to know what read statements were used in Fortran, so that you know the record structure. The extra bytes can be bypassed in recent Fortran by using Stream IO. With a hex editor one can reverse engineer these aspects.

Endianess normally can't be specified in Fortran. Another aspect that you could deduce with a hex editor. Gfortran has an extension to specify endianess.

edited May 23 '17 at 12:12

Community

1
1

answered Jul 10 '12 at 02:15

M. S. B.

28,968
2
46
73

You can also convert endianess yourself in Forfran. Of course, it's not very fast. – Vladimir F Героям слава Jul 10 '12 at 06:45

score 2 · Answer 3 · answered Jul 10 '12 at 18:22

Thank you all for your help. Particularly Hristo Iliev for a great explanation of hexdump output and mgilson for a useful code template.

For completeness, on the off chance that anyone else ever stumbles on this post trying to read an assoc file, here is the python code that worked for me.

import struct
import numpy as np
import matplotlib.pylab as pl

with open('elevmap.assoc','rb') as f:
    f.read(14400)
    data=struct.unpack('%dh' % (7200*3600), f.read(7200*3600*2))

# Now turn it into a numpy array
data = np.array(data).reshape(3600,7200)

pl.figure()
pl.imshow(data)
pl.show()

Which returned this pretty little map: Mars Elevation

It's Mars, South is up, but that's easy to fix. Thanks again all.

mgilson · Answer 4 · 2012-07-10T18:43:40.750

1

That isn't any file format that I've ever seen, however, based on the information in the header, you might be able to do something like:

import numpy as np
import struct

with open('datafile.assoc','rb') as f:
    nx,ny=f.read(14400).split()[:2]  #here I split the header and only take the array indices.  You could get more fancy with your parsing if you wanted.
    data=struct.upack('%dh'%(nx*ny),f.read(nx*ny*2))

#now turn it into a numpy array:
data = np.array(data).reshape(ny,nx)  #assume "x" is the fast index
data[data==30303] = np.nan
#some checks
print (np.nanmax(data)) # 21197
print (np.nanmin(data)) # -8204

If that doesn't work and you have the fortran code which generated the file (or fortran code which can read the file), that would be helpful as well.

edited Jul 10 '12 at 18:43

answered Jul 10 '12 at 00:17

mgilson

300,191
65
633
696

That's great thanks, as per Hristo Iliev's suggestion I needed to change to %dh and 7200*3600*2. – EddyTheB Jul 10 '12 at 18:25
@EddyTheB -- Yeah, I was going to change that when I saw your file was only ~50Mb, but I didn't have time to look up the code for 2-byte ints. I'm glad you figured it out. – mgilson Jul 10 '12 at 18:37
@EddyTheB -- It also looks like you could parse the field size information from the first line of the file: `nx,ny=open('file','r').readline().split()[:2]` (Of course, you should probably break this into multiple statements and close the file before continuing.) I've updated the code above to reflect that. – mgilson Jul 10 '12 at 18:39

score 1 · Answer 5 · answered Jul 10 '12 at 00:23

1

According to the header, the data is saved as a 7200 x 3600 array of INT.

The minimum value is negative, so it is presumably either a 16- or 32-bit signed int. (size of remaining file) / (7200 * 3600) should be the size of one int in bytes.

Remaining questions:

byte endian-ness (ie 623 could be saved as 00 00 02 6f (big-endian) or 6f 02 00 00 (little-endian)
row-by-row or column-by-column - try both and see which looks right

If you could give us the overall file size and a hex dump of the first 30-odd bytes of data, it would really help.

answered Jul 10 '12 at 00:23

Hugh Bothwell

55,315
8
84
99

If it was written in fortran, there might be record information to deal with as well...It's hard to say without seeing the fortran `open` statement. (We could guess based on filesize...). And, AFAIK, in fortran, there's no way to specify endianness of the file, so that is still a question. X is probably the "fast index" since it comes first in the meta-data and the first index in fortran is fastest. – mgilson Jul 10 '12 at 00:46

Associate (.assoc) binary file structure / probing unknown file structure

5 Answers5