How to read a binary file with a known header and file format for data analysis?

Question

I'm currently working doing some basic analysis/trying to make tools to automate some of the more quantitative parts of my job. One of these tasks is analyzing data from local instruments, and using that data to draw quantitative conclusions. The end goal is to calculate percent data coverage over a given region (What percent of values in area 'x' exceed value 'y'?). However, there are problems.

First, the data we are looking at is in binary. While the programmer's guides for the data document some of the data structure, they are very sparse in how to actually utilize the data for analysis outside of their proprietary programs.

Second, I am new to Python. While I tried programming tasks in python years ago, I did not end up making anything useful; I am more adept at shell scripting, can work with html/javascript/php, and managing a program using Fortran; I'm trying to learn Python to diversify.

What I know about the data in question: The binary file contains a 640-character long header made up of three parts. Each part is a a mixture of: characters; unsigned and signed 8, 16, and 32 bit integers; and 16 and 32 bit binary angles. After the header, the files show a cartesian grid of data as 'pixels' in an 'image'. Each 'pixel' within the 'image' is an one-byte unsigned character with a value between 0 and 255. The 'image' is a 2-D grid of 'x by y' with the next 'image' occurring after a given number of bytes (In this data set, the images are 720 by 720 'pixels', so the 'images' are separated after 720^2 bytes).

Right now, my goal is just to read the file into a python program and separate the various "images" for inspection. The initialized data/format are below:

testFile = 'C:/path/to/file/binaryFile'
headerFormat = '640c'
nBytesData = 720 * 720
# Below is commented out
inputFile = open(testfile, 'rb')

I have been able to read the file in as a binary file, but I have no clue how to inspect it. First instinct was to try and put it in a numpy array, but additional research suggested using the struct module and struct.unpack to break apart the data. From what I've read, the following block should unpack each 'image' correctly after the initial header, even if it's not the most efficient method:

header_size = struct.calcsize(headerFormat)
testUnpacked = []
with open(testFile, 'rb') as testData:
    headerOut = testData.read(header_size)
    print("header is: ", headerOut)
    while True:
        testContent = testData.read()
        if not testContent: break
        testArray = struct.unpack(testContent, nBytesData)
        testUnpacked.append(testArray)

The problem is I do not know how to set up the code to unpack/skip the header to the binary file. I do not think the headerFormat = '640c' line of code, plus the next couple of commands to try and format its output, correct. I was able to output a line that the program, run in PyCharm, interpreted as the "header", and below is a sample of the output starting from the first 'print': b'\x1b\x00\x08\x00\x80\xd4\x0f\x00\x00\x00\x00\x00\x1a\x00\x06\x00@\x01\x00\x00\x00\x00\x00\x00\x03\x00\x02\x00\x00\x00\x00\x00}\t\x0

After that, I got a error stating that there is an embedded null character preventing the data from saving to the designated array.

Other questions I referenced to try and figure out how to read the data:

Reading a binary file with python Reading a binary file into a struct Fastest way to read a binary file with a defined format?

Main questions are as follows:

How do I tell the program to read the binary file header and then start reading the file according to the 720^2 arrays?
How do I tell the program to save the header in a format I can understand?
How do I figure out what is causing the struct.error message?

bb1 · Accepted Answer · 2022-02-13T04:50:44.923

Based on this description it is difficult to say how one could read the header, since this will depend on its specific structure. It should be possible though to read the rest of the file.

Start by reading the file as a byte array:

with open(testFile, 'rb') as testData:
    data = testData.read()

len(data) will give the number of bytes. Assuming that the header consists of fewer than 720^2 bytes, and that the rest of the bytes is subdivided into images 720^2 bytes each, the reminder from the division of len(data) by 720^2 will give the length of the header:

len_header = len(data) % 720**2

You can then disregard the header and convert the remaining bytes into integers:

pixels = [b for b in data[len_header:]]

Next, you can use numpy to rearrange this list into a 2-dimensional array with 720^2 columns, so that each row consists of pixels of a single image:

import numpy as np

images = np.array(pixels).reshape(-1, 720**2)

Each image can be now accessed as images[i] where i is the index of a row. This is a 1-dimensional array, so to make it into a 2-dimensional structure representing an image reshape again:

images[i].reshape(720, 720)

Finally, you can use matplotlib to display the image and check if it looks correctly:

import matplotlib.pyplot as plt

plt.imshow(images[i].reshape(720, 720), cmap="gray_r")
plt.show()

You actually answered my question exactly. I had to modify my code a bit to include iterators/adjust for some different variable names, but I was able to omit the header and plot the data. Sorry I was vague in the initial question, but you provided help that would have taken me days to figure out. Thanks for the assistance! — tenkiforecast, Feb 13 '22 at 21:01

How to read a binary file with a known header and file format for data analysis?

1 Answers1