1

I have a CSV file which has NUL byte embedded within some data.

That is given columns A B C D one of the fields in column C would have data like

, quote character"Some Data" NUL "More Data" NUL "End of data" quote character,

When I open it with LIBRE Office Calc, the NUL characters do not appear in the display and if I save it by hand, they go away. I can see the NUL characters in vi and could remove or replace them with tr or by hand in vi, but I want to be able to handle it with the python program automatically.

The DictReader process is

for row in infile: which throws the exception and the except is therefore outside the loop and would not go back to get the next line (or allow me to change the NUL character to a space or embedded comma and process that line).

Luckily, the data appears to have other invalidations so I would probably skip it in any event. However, the question would be how do I tell Python to go to the next line.

sabbahillel
  • 4,357
  • 1
  • 19
  • 36

1 Answers1

1

So this is a bit ugly, but it seems to work. You can read a line like normal, clean the offending bytes, then use a StringIO object to pass it to DictReader. Here's the code, assuming your csv has a header record (it should be more simple if you don't):

#!/usr/bin/env python

import StringIO
import csv 
import ipdb

fin = open('somefilewithnulls', 'rb')
fout = StringIO.StringIO()
reader = csv.DictReader(fout)

while True:
    # for the first record prep StringIO with the first
    # two lines so DictReader can create header
    line = fin.readline() if fin.tell() else fin.readline() + fin.readline()
    if not len(line):
        break

    # clean the line before passing it to DictReader
    line = line.replace('\x00', '') 

    fout.write(line)
    fout.seek(-len(line), 1)

    rec = reader.next()
    print rec
fivetentaylor
  • 1,277
  • 7
  • 11
  • Thank you. For now, I have a preprocessor in bash using tr to clean the files as they are being initially set up in the processing dictionary. I will keep this in mind for the future so that it could be part of the python processing. – sabbahillel Jan 13 '14 at 18:27