Python reads "20" instead of "00" from binary file

Question

I'm writing a code meant to read a binary file and print the hex representation of its data as a csv, using NULL values as a separator. When looking at a file in a binary/hex viewer, it shows me this sequence as part of the file:

41 73 73 65 6d 62 6c 79 c8 2d 01 00 04 00 00 00 07 00 00 00 00

However, reading the file with this part of code:

with open(file_in, "rb") as f:
    while (byte := f.read(1)):
        h_value = hex(ord(byte))
        h_value = ("0" + h_value[2:])[-2:]
        #print(byte)
        #print(h_value)
        if h_value != '00':
            data_read.append(h_value)
        else:
            data_read.append(h_value)
            if data_read:
                with open(file_out, 'a', newline = '') as c:
                    w = csv.writer(c)
                    w.writerow(data_read)
            data_read = []

Gives me this for that section instead:

41,73,73,65,6d,62,6c,79,c3,88,2d,01,20,04,20,20,20,07,20,20,20,20

Which is relevant, because there are actual "20" values elsewhere in the file as data. Using the "print(byte)" and "print(h_value)" return b' ' and 20 respectively, which makes me think that it's Python reading the file wrong, not just the output being converted. Is there anything I can do to preserve these NULL values through the process?

Edit 1: Additional info, this is running Python 3.8.2 using IDLE. No idea if the compiler would make a difference for this, but I'm going to see if Visual Studio gives me different results. The binary viewer is simply named Binary Viewer, version 6.17.

It looks like your data got mangled by several additional layers of processing at some point, including a nulls-to-spaces conversion and an attempt at UTF-8 encoding (note what happened to the c8 byte). We have no idea where those additional processing layers happened and no idea what you need to change to stop them from happening. — user2357112, Sep 11 '20 at 00:38
@user2357112supportsMonica I hadn't even noticed the c8 byte change before you pointed it out, but it looks like the solution I found addresses that as well. — Getor Appi, Sep 11 '20 at 01:08

paxdiablo · Answer 1 · 2020-09-11T01:20:59.553

There's nothing wrong with Python's reading of the file nor with the CSV creation, as evidenced by the following program:

import os, csv

os.system("od -xcb qq.in") # Show file as byte dump.

data_read = []
with open("qq.in", "rb") as f:
    byte = f.read(1)
    while (byte):
        h_value = hex(ord(byte))
        h_value = ("0" + h_value[2:])[-2:]
        data_read.append(h_value)
        print(ord(byte), h_value) # Check individual bytes.
        byte = f.read(1)

print(data_read)
with open("file_out.csv", 'w') as c:
    w = csv.writer(c)
    w.writerow(data_read)
os.system("cat file_out.csv") # Show final CSV output.

The output of that program is:

0000000    7341    6573    626d    796c    2dc8    0001    0004    0000
          A   s   s   e   m   b   l   y 310   - 001  \0 004  \0  \0  \0
        101 163 163 145 155 142 154 171 310 055 001 000 004 000 000 000
0000020    0007    0000    0000
         \a  \0  \0  \0  \0
        007 000 000 000 000
0000025
(65, '41')
(115, '73')
(115, '73')
(101, '65')
(109, '6d')
(98, '62')
(108, '6c')
(121, '79')
(200, 'c8')
(45, '2d')
(1, '01')
(0, '00')
(4, '04')
(0, '00')
(0, '00')
(0, '00')
(7, '07')
(0, '00')
(0, '00')
(0, '00')
(0, '00')
['41', '73', '73', '65', '6d', '62', '6c', '79', 'c8', '2d', '01', '00', '04', '00', '00', '00', '07', '00', '00', '00', '00']
41,73,73,65,6d,62,6c,79,c8,2d,01,00,04,00,00,00,07,00,00,00,00

Hence I would start by looking at your input file a little more closely, it's likely that it is the problem.

Especially since there appears to be another change from your input, the c8 byte has been changed into c3 88 - this is a Unicode encoding transformation.

As you can see from this answer, 0xc8 is in the two-byte UTF-8 section:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

The code point c8 is the bit sequence 000 1100 1000 so will be transformed into UTF-8 as 1100 0011 1000 1000, or c3 88.

score 0 · Answer 2 · answered Sep 11 '20 at 01:07

With the information from the comments and paxdiablo's answer, I decided there must be something wrong with the file itself, since by all counts the problem shouldn't be with Python. I opened it in the binary viewer again and exported it as a new .BIN file. The new file reads the way it's supposed to, so it looks like that's the solution.

Python reads "20" instead of "00" from binary file

2 Answers2