2

I want to read a file which contains huge binary data. I want to convert this binary data into ASCII format. At the time of start, I want to read 2 bytes which indicates size of message, message is ahead of size. After reading this whole message, again repeat same action, 2 bytes for size of message and then actual message.

code to print input data-

with open("abc.dat", "rb") as f:
byte = f.read(1)
i = 0 
while byte:
    i += 1
    print byte+' ',
    byte = f.read(1)
    if i is 80:
        sys.exit()

Input Data(80 bytes)-

  O  T  C  _  A  _  R  C  V  R                                                            P  V  �  W          �  w              /  �              �  �  '            �  �  &  �  

edit1- . > output ussing hexdump -n200 otc_a_primary_1003_0600.dat command-

0000000 4f03 4354 415f 525f 5643 0052 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
0000020 0000 0000 0000 0000 5650 57f2 0000 0000
0000030 77d1 0002 0000 0000 902f 0004 0000 0000
0000040 a2bd 1027 0000 0000 d695 e826 2e0b 3e11
0000050 aa55 0300 f332 0000 0046 0000 0000 0000
0000060 5650 57f2 0000 0000 22f8 0a6c 0000 0000
0000070 3030 3030 3730 3435 5135 0000 0000 0100
0000080 bdb4 0100 3000 5131 5a45 1420 077a 9c11
0000090 3591 1416 077a 9c11 dc8d 00c0 0000 0000
00000a0 0000 4300 5241 2020 7f0c 0700 ed0d 0700
00000b0 2052 2020 2030 aa55 0300 f332 0000 0046
00000c0 0000 0000 0000 5650                    
00000c8

I'm using python's struct module. python version - python 2.7.6

program code-

import struct

msg_len = struct.unpack('h', f.read(2))[0]
msg_data = struct.unpack_from('s', f.read(msg_len))[0]
print msg_data

But I'm not able to see actual message, only single character is printing on console. How I can read such binary file's message in appropriate manner?

ketan
  • 2,732
  • 11
  • 34
  • 80
  • 2
    Can you put the first 50 bytes of your file in the question please? It could be an endianness problem but it's hard to guess without being able to test. Also, please specify if your using python 2.X or 3.X. – Kruupös Oct 12 '16 at 09:00
  • Clearly [tag:python-2.x] as there is no parenthesis in `print` – tripleee Oct 12 '16 at 09:02
  • `print(repr(msg_data)[0:50])` might be helpful for diagnostics. – tripleee Oct 12 '16 at 09:03
  • `msg_data = f.read(msg_len)` gives what you want? – acw1668 Oct 12 '16 at 09:06
  • please check input data is added with first 80 bytes. – ketan Oct 12 '16 at 09:16
  • 2
    Thanks, however, I was looking for of the first bytes of your `abc.dat` file and the desired output. Since clearly the input data is not what you expected. Can you try a `hexdump -n80 abc.dat` for instance? – Kruupös Oct 12 '16 at 09:30

3 Answers3

2

It depends on how your two byte length is stored in the data, for example, if the first two bytes of your file (as hex) were 00 01 does this mean a message following is 1 byte long or 256 bytes long? This is referred to as either big or little endian format. Try both of the following, one should give more meaningful results, it is designed to read the data in message length chunks:

Big endian format

import struct

with open('test.bin', 'rb') as f_input:
    length =  f_input.read(2)

    while len(length) == 2:
        print f_input.read(struct.unpack(">H", length)[0])
        length =  f_input.read(2)

Little endian format

import struct

with open('test.bin', 'rb') as f_input:
    length =  f_input.read(2)

    while len(length) == 2:
        print f_input.read(struct.unpack("<H", length)[0])
        length =  f_input.read(2)

The actually data will need further processing. The H tells struct to process the 2 bytes as an unsigned short (i.e. the value can never be considered to be negative).

Something else to consider is that sometimes the length includes itself, so a length of 2 could mean an empty message.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • At the end how I can view my original data in human readable format? f_input.read() method returns data in binary format only. which format is useful for this? – ketan Oct 12 '16 at 09:52
  • Could you post a link to your file using something like [Jumpshare](https://jumpshare.com/)? It would make it much easier to figure out the format of the file. – Martin Evans Oct 12 '16 at 10:00
  • No man I can't do like that. please see my edit for reference. – ketan Oct 12 '16 at 10:10
  • I've had a look at the data. Binary files need a specification to work from which documents how each of the fields contained have been encoded. It would help to know where the file came from, or what application created it. With this information, a Python script could then extract all the details correctly. – Martin Evans Oct 12 '16 at 10:33
1

from the docs:

For the 's' format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting string always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).

's' should be modified to str(msg_len)+'s'. It seems like a good idea to check that msg_len is sensible in advance.

IljaBek
  • 601
  • 11
  • 21
  • I'm facing **struct.error: bad char in struct format** error. please specify with sample code. – ketan Oct 12 '16 at 09:21
  • could you print `msg_len` to check, before using in the unpack_from. What does it say? – IljaBek Oct 12 '16 at 09:27
  • `msg_len=3; print struct.unpack_from(str(msg_len)+'s',"abcde")[0]` gives me 'abc' method works, need to check the input – IljaBek Oct 12 '16 at 09:30
  • @IljaBek- ohh each time message length is different like 12832, 8274, 45, 768. – ketan Oct 12 '16 at 09:30
  • That error is generated if `msg_len` is negative (which isn't a possible length for a string). This could occur for a number of reasons such as wrong endianness on length field, negative values used as some sort of special marker, corrupt data, or the number should be unsigned. As a secondary comment, if you're only going to unpack that single string and it's the length you read in the first place, the result is identical to the string `read` returned in the first place. – Yann Vernier Oct 12 '16 at 10:14
1

Try:

import struct

with open('abc.dat', 'rb') as f:
    while True:
        try:
            msg_len = struct.unpack('h', f.read(2))[0] # assume native byte order
            msg_data = f.read(msg_len) # just read 'msg_len' bytes
            print repr(msg_data)
        except:
            # something wrong or reach EOF
            break
acw1668
  • 40,144
  • 5
  • 22
  • 34