1

I am trying to fix an issue I'm having with null bytes in a CSV files.

The csv_file object is being passed in from a different function in my Flask application:

stream = codecs.iterdecode(csv_file.stream, "utf-8-sig", errors="strict")
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")


for row in dict_reader:  # Error is thrown here
    ...

The error thrown in the console is _csv.Error: line contains NULL byte.

So far, I have tried:

  • different encoding types (I checked the encoding type and it is utf-8-sig)
  • using .replace('\x00', '')

but I can't seem to get these null bytes to be removed.

I would like to remove the null bytes and replace them with empty strings, but I would also be okay with skipping over the row that contains the null bytes; I am unable to share my csv file.

EDIT: The solution I reached:

    content = csv_file.read()

    # Converting the above object into an in-memory byte stream
    csv_stream = io.BytesIO(content)

    # Iterating through the lines and replacing null bytes with empty 
    string
    fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)


    # Below remains unchanged, just passing in fixed_lines instead of csv_stream

    stream = codecs.iterdecode(fixed_lines, 'utf-8-sig', errors='strict')

    dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")
chaztize7
  • 63
  • 5
  • What object did you try the `replace('\x00', '')` with, `stream`? Also, what kind of object is `csv_file`? – Zach Young Feb 18 '22 at 23:33
  • The csv_file is: . It is getting passed from a flask endpoint into the function where the code in question is. I have tried using replace csv_file, stream, and dict_reader, as well as row.values in the loop. – chaztize7 Feb 21 '22 at 14:10
  • I ran `print(list(stream))` and found that the last row of data contains a field at the end that looks like: '\x00\x00\x00\x00\x00' except much bigger. I understand how to find where the problem is, I am not sure how to remove the values in this field given my current object structure – chaztize7 Feb 21 '22 at 14:37
  • Please update/edit your question and include a sample of the csv_file object. – Zach Young Feb 21 '22 at 15:12

1 Answers1

1

I think your question definitely needs to show a sample of the stream of bytes you expect from csv_file.stream.

I like pushing myself to learn more about Python's approach to IO, encoding/decoding, and CSV, so I've worked this much out for myself, but probably don't expect others to.

import csv
from codecs import iterdecode
import io

# Flask's file.stream is probably BytesIO, see https://stackoverflow.com/a/18246385 
# and the Gist in the comment, https://gist.github.com/lost-theory/3772472?permalink_comment_id=1983064#gistcomment-1983064

csv_bytes = b'''\xef\xbb\xbf C1, C2
 r1c1, r1c2
 r2c1, r2c2, r2c3\x00'''

# This is what Flask is probably giving you
csv_stream = io.BytesIO(csv_bytes)

# Fixed lines is another iterator, `(line.repl...)` vs. `[line.repl...]`
fixed_lines = (line.replace(b'\x00', b'') for line in csv_stream)

decoded_lines = iterdecode(fixed_lines, 'utf-8-sig', errors='strict')

reader = csv.DictReader(decoded_lines, skipinitialspace=True, restkey="INVALID")

for row in reader:
    print(row)

and I get:

{'C1': 'r1c1', 'C2': 'r1c2'}
{'C1': 'r2c1', 'C2': 'r2c2', 'INVALID': ['r2c3']}
Zach Young
  • 10,137
  • 4
  • 32
  • 53
  • 1
    Modified your solution to fit the structure of my own data. This fixed the issue with null chars exiting out of the dictreader loop. Adding my own solution to my original question for accessibility. Thank you Zach – chaztize7 Feb 21 '22 at 16:27