Identify the contents a file through a program in python

Question

I have a file here. To me it appears it is a binary file. This is raw file and I believe that it has the stock information in OHLCV (Open, High, Low, Close, Volume). Besides it may also have some text.

One of the entries that I could possibly have for OHLCV is

464.95, 468.3, 460, 465.65, 3957854

This is the code that I have tried. I dont fully understand about ASCII and Unicode.

input_file = "00063181.dat" # tata motors
with open(input_file, "rb") as fh:
    buf = fh.read()
output_l = list(map(int , buf))
print (output_l)

My Doubt: How do I decode this file and make sense out of it? Is there any way for me to read this file through a program written in python and separate the text from int/float? I am using Python 3 and Win 10 64 bit.

Glad to hear you're using Python 3. – Jason R. Coombs Dec 24 '16 at 13:29 — Jason R. Coombs, Dec 24 '16 at 13:29

score 1 · Accepted Answer · answered Dec 24 '16 at 13:55

You're looking to reverse engineer the structure of a binary file using Python. Since you've declared that the file is binary, it may prove difficult. You're going to need to examine the contents of the file and use your best intuition to try to infer the structure. The first thing you're going to want is a way to display each of the bytes of the file an a way that will help you understand the meaning.

Fortunately, someone has already written a tool to do this, hexdump. Install that package using pip.

The function you need from that package is hexdump, so let's import it the package and get help on the function.

>>> import hexdump
>>> help(hexdump.hexdump)
Help on function hexdump in module hexdump:

hexdump(data, result='print')
    Transform binary data to the hex dump text format:

    00000000: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................

      [x] data argument as a binary string
      [x] data argument as a file like object

    Returns result depending on the `result` argument:
      'print'     - prints line by line
      'return'    - returns single string
      'generator' - returns generator that produces lines

Now you can start to explore the contents of your file. Use the slice operator to do it in chunks. For example, to render the contents of the first 1KB of your file:

>>> hexdump.hexdump(buf[:1024])
00000000: C3 8E C2 8F 22 13 C2 AA  66 2A 22 47 C3 94 C3 AA  ...."...f*"G....
00000010: C3 89 C3 A0 C3 B1 C3 91  6A C2 A4 C3 BF 3C C2 AA  ........j....<..
00000020: C2 91 73 C3 85 46 57 47  C2 88 C3 99 C2 B6 3E 2D  ..s..FWG......>-
00000030: C3 BA 69 10 C2 93 C3 94  38 C3 81 7A 6A 43 30 7C  ..i.....8..zjC0|
00000040: C3 BB C2 AA 01 2D C2 97  C3 83 C3 88 64 14 C3 9C  .....-......d...
00000050: C2 AB C2 AA C3 A2 74 C2  85 5D C3 97 4E 64 68 C3  ......t..]..Ndh.
...
000003C0: 42 C2 8F 06 7F 12 33 7F  79 1E 2C 2A 0F C3 92 36  B.....3.y.,*...6
000003D0: C3 A6 C2 96 C2 93 C2 8B  43 C2 9F 4C C2 95 48 24  ........C..L..H$
000003E0: C2 B3 C2 82 26 C3 88 C3  BD C3 96 12 1E 5E 18 2E  ....&........^..
000003F0: 37 C3 A7 C2 87 C3 AE 00  4F 3F C2 9C C3 A8 1C C2  7.......O?......

Hexdump has a nice property of rendering the byte position, the hex code, and then (if possible) the printable form of the character on the right.

Hopefully some of your text values will be visible there and that will give some clue as to how to reverse engineer your file.

Once you've started to determine how your file is structured, you can use the various string operators to manipulate your data. For example, if you find that your file is split into sections by the null byte (b'\x00'), you can get those sections thus:

>>> sections = buf.split(b'\x00')

There are a lot of things that you're likely to have to learn as you dig deeper, like character encodings, number encodings (including little-endian for integers and floating-point encoding for floating point numbers). You'll want to find some way to externally validate your results.

Best of luck.

Identify the contents a file through a program in python

1 Answers1