1

I'm trying to create a new version of a file that excludes NULL bytes. I'm using the code below to attempt this however it's still breaking on the NULL byte. How should I structure the for statement and try-catch block to keep going after the exception?

import csv

input_file = "/data/train.txt"
outFileName = "/data/train_no_null.txt"
############################

i_f = open( input_file, 'r' )
reader = csv.reader( i_f , delimiter = '|' )

outFile = open(outFileName, 'wb') 
mywriter = csv.writer(outFile, delimiter = '|')

i_f.seek( 0 )
i = 1

for line in reader:
    try:
        i += 1
        mywriter.writerow(line)

    except csv.Error:
        print('csv choked on line %s' % (i + 1))
        pass

EDIT:

Here's the error message:

Traceback (most recent call last):
  File "20150310_rewrite_csv_wo_NULL.py", line 26, in <module>
    for line in reader:
_csv.Error: line contains NULL byte

UPDATE:

I'm using this code:

i_f = open( input_file, 'r' )
reader = csv.reader( i_f , delimiter = '|' )
# reader.next()

outFile = open(outFileName, 'wb') 
mywriter = csv.writer(outFile, delimiter = '|')

i_f.seek( 0 )
i = 1


for idx, line in enumerate(reader):
    try:
        mywriter.writerow(line)
    except:
        print('csv choked on line %s' % idx)

and now get this error:

Traceback (most recent call last):
  File "20150310_rewrite_csv_wo_NULL.py", line 26, in <module>
    for idx, line in enumerate(reader):
_csv.Error: line contains NULL byte
screechOwl
  • 27,310
  • 61
  • 158
  • 267

2 Answers2

0

You can catch all errors with the following code...

for idx, line in enumerate(reader):
    try:
        mywriter.writerow(line)
    except:
        print('csv choked on line %s' % idx)
Alex
  • 18,484
  • 8
  • 60
  • 80
0

The exception is being thrown from the reader, which is not being caught as it is outside of the try/catch.

But even if it was, the reader won't want to continue after its encounter with the NUL byte. But if the reader never saw it, along the lines of...

for idx, line in enumerate(csv.reader((line.replace('\0','') for line in open('myfile.csv')), delimiter='|')):

you might be OK.

Really though, you should find out where the NUL bytes are coming from as they might be symptomatic of a wider problem with your data.

langton
  • 126
  • 1
  • 3
  • the data is coming out of a redshift database and generally hanging out in S3. Any idea if it's a function of that environment or it's in the data before going into redshift? – screechOwl Mar 10 '15 at 22:33
  • I'm not sure how you've ended up with those NUL bytes in the data then, getting them into Redshift would be difficult: _If your data includes a null terminator, also referred to as NUL (UTF-8 0000) or binary zero (0x000), COPY treats it as an end of record (EOR) and terminates the record._ [http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html]. And if you can't get them in there, I don't know how you got them back out! – langton Mar 10 '15 at 22:44