105

I'm trying to write a program that looks at a .CSV file (input.csv) and rewrites only the rows that begin with a certain element (corrected.csv), as listed in a text file (output.txt).

This is what my program looks like right now:

import csv

lines = []
with open('output.txt','r') as f:
    for line in f.readlines():
        lines.append(line[:-1])

with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'r') as mycsv:
        reader = csv.reader(mycsv)
        for row in reader:
            if row[0] not in lines:
                writer.writerow(row)

Unfortunately, I keep getting this error, and I have no clue what it's about.

Traceback (most recent call last):
  File "C:\Python32\Sample Program\csvParser.py", line 12, in <module>
    for row in reader:
_csv.Error: line contains NULL byte

Credit to all the people here to even to get me to this point.

Community
  • 1
  • 1
James Roseman
  • 1,614
  • 4
  • 18
  • 24
  • Just a guess but it sounds like your input.csv file contains a blank line (mebe at the end?). Try lookin in the csvParser.py file for that exception text. – Sam Axe Oct 25 '11 at 19:43
  • I actually just went through the input.csv file and got rid of any and all blank space... still no luck (same error). – James Roseman Oct 25 '11 at 19:50
  • To pinpoint the line number, I suggest you introduce a counter variable and increment it within the ``for row in reader`` loop. – codeape Oct 25 '11 at 19:50
  • I'm not sure how I'm supposed to do that when the program itself won't execute. I tried adding a counter and nothing different showed up, just the same traceback error. – James Roseman Oct 25 '11 at 19:54
  • 5
    Do you have a NULL byte in your .csv? `open('input.csv').read().index('\0')` will give you the offset of the first one if you do. – retracile Oct 25 '11 at 19:55
  • @retracile, I'm sorry, do you mean as a standalone program (tried it -- it doesn't come up with anything) or in the code itself? – James Roseman Oct 25 '11 at 19:58

15 Answers15

87

I'm guessing you have a NUL byte in input.csv. You can test that with

if '\0' in open('input.csv').read():
    print "you have null bytes in your input file"
else:
    print "you don't"

if you do,

reader = csv.reader(x.replace('\0', '') for x in mycsv)

may get you around that. Or it may indicate you have utf16 or something 'interesting' in the .csv file.

retracile
  • 12,167
  • 4
  • 35
  • 42
  • 7
    +1 on finding NULL bytes in the file... unforetunately now my 'corrected.csv' file now reads in Japanese... – James Roseman Oct 25 '11 at 20:09
  • Sounds like your .csv isn't in ascii. I think further help is going to require a bit more information about your .csv's actual content. Have you tried opening it in a text editor like vim or notepad? Or running `file input.csv` to identify the file type? – retracile Oct 25 '11 at 20:14
  • I've opened it in Notepad and it looks fine. What should a csv look like? It reads the same as it does on Google Analytics, but with huge tabs between the data. – James Roseman Oct 25 '11 at 20:16
  • Damn... is there any way to replace the tabs with commas and have it work with the Python program? – James Roseman Oct 25 '11 at 20:21
  • just use csv.reader(...., `delimiter='\t'`), though that's not going to address the null bytes problem. – retracile Oct 25 '11 at 20:23
  • 1
    If your csv is tab delimited you need to specify so: `reader = csv.reader(mycsv, delimiter='\t')`. I imaging that the csv reader is gobbling up your whole file looking for the commas and getting all the way to EOF. But you definitely have an encoding issue. You need to specify the encoding when opening the file. – Steven Rumbalski Oct 25 '11 at 20:26
  • I'm still getting the following error `Traceback (most recent call last): File "C:\Python32\Sample Program\csvParser.py", line 12, in for row in reader: _csv.Error: line contains NULL byte` – James Roseman Oct 25 '11 at 20:28
  • I resaved it from GAnalytics as a true CSV and got the following issue: `Traceback (most recent call last): File "C:\Python32\Sample Program\csvParser.py", line 13, in if row[0] not in lines: IndexError: list index out of range` Does it make a difference if the CSV file has about 500+ entries? – James Roseman Oct 25 '11 at 20:39
  • Sounds like you have a line without any columns. Also http://code.google.com/p/ga-api-http-samples/source/browse/trunk/src/data_export/v2/python/pagination/feed_printer.py implies UTF-8 encoding. – retracile Oct 25 '11 at 20:42
  • +1 for the NULL bytes. This is the classic "Windows" version of utf8 that is *not quite* utf8. Download [notepad++](http://notepad-plus-plus.org/) and use it's encoding menu to convert to ascii. – Spencer Rathbun Oct 25 '11 at 20:59
  • @SpencerRathbun: WHAT is this not-quite-utf8 version, and why do you think it is causing any of the OP's problems, includilg NULL (sic) bytes??? – John Machin Oct 25 '11 at 22:38
  • @JohnMachin [windows-1252](http://en.wikipedia.org/wiki/Windows-1252) see also [windows code page](http://en.wikipedia.org/wiki/Windows_code_page) note how it was superseded by unicode. In windows xp, this is the default encoding to allow for "upper ascii" it does horrible things to python if you don't get the correct encoding, and it's non obvious, because it looks kind of like an early unicode in hex. When I first ran into it, I tried utf-8 and ascii, both produced NULL values from my file. The encodings don't match, so eventually a read gets a NULL all by itself, and can't make sense of it. – Spencer Rathbun Oct 26 '11 at 00:41
  • @SpencerRathbun: cp1252 is nothing to do with UTF-8, "early unicode in hex" is meaningless. ALL of (UTF-8, ASCII, and cp1252) have a NUL ("\x00") (it's not NULL!) byte. **If decoding your file as either ASCII or UTF-8 produced NULs, they were in the file already -- probably because the file was encoded in UTF-16LE** (what Windows calls "Unicode"), NOT cp1252. – John Machin Oct 26 '11 at 01:36
  • @JohnMachin I don't claim to be an expert at unicode, and I meant NUL. I wasn't aware that the difference in spelling was important. It's been awhile, so you're probably right. I just remember that it was windows being "special" and that python kept kicking out NullValueExceptions during encode/decode. It was the file type, and I thought it was cp1252. Thanks for the update, and I hope to leave cp1252, UTF-16LE and so on in the dustbin of history to rot forever. – Spencer Rathbun Oct 26 '11 at 01:45
  • @SpencerRathbun: NullValueException appears to be a Java and/or C# thing, nothing to do with Python. Your recollections don't appear to be very reliable. – John Machin Oct 26 '11 at 02:02
  • @JohnMachin six months will do that. I fixed it and we ran the files under the deadline. It was an unicode exception talking about NUL and I haven't seen it since. – Spencer Rathbun Oct 26 '11 at 02:12
81

I've solved a similar problem with an easier solution:

import codecs
csvReader = csv.reader(codecs.open('file.csv', 'rU', 'utf-16'))

The key was using the codecs module to open the file with the UTF-16 encoding, there are a lot more of encodings, check the documentation.

K. David C.
  • 958
  • 7
  • 3
  • 4
    I had this same problem with a CSV file created from LibreOffice, which had been originally opened from an Excel .xls file. For some reason, LibreOffice had saved the CSV file as UTF-16. You can tell by looking at the first 2 bytes of the file, if it's FF FE then it's a good indicator that it's UTF-16 – Tom Dalton Nov 01 '13 at 09:43
  • 4
    Note that if your file contains UTF-16 data that is *outside of the ASCII range* `csv.reader()` will not be able to handle it, and you'll get `UnicodeEncodeError`s instead. – Martijn Pieters Sep 04 '14 at 09:30
  • 12
    This just caused a different error to be raised, `UnicodeError: UTF-16 stream does not start with BOM` – Cerin Sep 05 '17 at 21:13
  • In my case it was `'utf-16le'`. – Paweł Szczur Jul 11 '18 at 12:51
  • in my case it is utf-8 and this is not helping – dermoritz Apr 08 '21 at 11:30
25

If you want to replace the nulls with something you can do this:

def fix_nulls(s):
    for line in s:
        yield line.replace('\0', ' ')

r = csv.reader(fix_nulls(open(...)))
Claudiu
  • 224,032
  • 165
  • 485
  • 680
  • 4
    Replacing null with a space won't be a good choice. Worked for me to replace with a empty string – Marcelo Assis Jun 07 '18 at 17:05
  • I have a question about how you have used yield. Given that this is in a loop, does it mean that it will still read the file line by line or would it load it into the memory at once? – Mansour.M Oct 27 '20 at 14:31
  • Thanks, Claudiu. This is an elegant, easy-to-adapt solution. However, what would be the difference if I replace the yield with return? Could you please explain me the difference for this particular case? – madhur Jan 11 '23 at 16:23
  • Or why to even return or yield it i.e. def fix_nulls(s): for line in s: line.replace('\0', ' ') – madhur Jan 11 '23 at 16:31
13

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

See the (line.replace('\0','') for line in f) below, also you'll want to probably open that file up using mode rb.

import csv

lines = []
with open('output.txt','r') as f:
    for line in f.readlines():
        lines.append(line[:-1])

with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'rb') as mycsv:
        reader = csv.reader( (line.replace('\0','') for line in mycsv) )
        for row in reader:
            if row[0] not in lines:
                writer.writerow(row)
woot
  • 7,406
  • 2
  • 36
  • 55
  • 1
    Thanks! This worked for the NC election results files, which do indeed (!) use a null byte in place of a "0" byte in one column. See http://dl.ncsbe.gov/ENRS/resultsPCT20161108.zip – nealmcb Dec 07 '16 at 22:07
  • This solution worked for me. Replaced this NUL characters before to read each line in the processing section. Thanks! – eduardosufan May 20 '22 at 17:16
7

This will tell you what line is the problem.

import csv

lines = []
with open('output.txt','r') as f:
    for line in f.readlines():
        lines.append(line[:-1])

with open('corrected.csv','w') as correct:
    writer = csv.writer(correct, dialect = 'excel')
    with open('input.csv', 'r') as mycsv:
        reader = csv.reader(mycsv)
        try:
            for i, row in enumerate(reader):
                if row[0] not in lines:
                   writer.writerow(row)
        except csv.Error:
            print('csv choked on line %s' % (i+1))
            raise

Perhaps this from daniweb would be helpful:

I'm getting this error when reading from a csv file: "Runtime Error! line contains NULL byte". Any idea about the root cause of this error?

...

Ok, I got it and thought I'd post the solution. Simply yet caused me grief... Used file was saved in a .xls format instead of a .csv Didn't catch this because the file name itself had the .csv extension while the type was still .xls

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • 1
    `Traceback (most recent call last): File "C:\Python32\Sample Program\csvParser.py", line 17, in print ('csv choked on line %s' % (i+1)) NameError: name 'i' is not defined` – James Roseman Oct 25 '11 at 20:03
  • Ok. Then it's choking on the very first line. Run this and post what you see: `print(open('input.csv', 'r').readlines()[0])` – Steven Rumbalski Oct 25 '11 at 20:08
  • Something funky... but it's running. `ÿþ/` < That's all it would paste (it's mostly blocks and numbers) – James Roseman Oct 25 '11 at 20:10
  • 1
    Perhaps your csv isn't really a csv. See the second half of my answer. – Steven Rumbalski Oct 25 '11 at 20:12
  • Oh shoot that could completely be it, how might I go about fixing this? I saved it straight from Google Analytics too... – James Roseman Oct 25 '11 at 20:12
  • Then your file is probably fine. – Steven Rumbalski Oct 25 '11 at 20:27
  • This worked well for me. Thank you very much! However, it added unnecessary newlines int the corrected csv. I would recommend using ''wb' and 'rb' while opening the input and correct csv files, to treat them as binary files. – Jim C Apr 19 '17 at 05:10
4
    def fix_nulls(s):
        for line in s:
        yield line.replace('\0', '')

    with open(csv_file, 'r', encoding = "utf-8") as f:
        reader = csv.reader(fix_nulls(f))
        for line in reader:
            #do something

this way works for me

masterlk
  • 41
  • 1
3

A tricky way:

If you develop under Lunux, you can use all the power of sed:

from subprocess import check_call, CalledProcessError

PATH_TO_FILE = '/home/user/some/path/to/file.csv'

try:
    check_call("sed -i -e 's|\\x0||g' {}".format(PATH_TO_FILE), shell=True)
except CalledProcessError as err:
    print(err)    

The most efficient solution for huge files.

Checked for Python3, Kubuntu

SergO
  • 2,703
  • 1
  • 30
  • 23
2

for skipping the NULL byte rows

import csv

with open('sample.csv', newline='') as csv_file:
    reader = csv.reader(csv_file)
    while True:
        try:
            row = next(reader)
            print(row)
        except csv.Error:
            continue
        except StopIteration:
            break
shrhawk
  • 621
  • 7
  • 10
1

I've recently fixed this issue and in my instance it was a file that was compressed that I was trying to read. Check the file format first. Then check that the contents are what the extension refers to.

Daniel Lee
  • 7,189
  • 2
  • 26
  • 44
1

Turning my linux environment into a clean complete UTF-8 environment made the trick for me. Try the following in your command line:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
Philippe Oger
  • 1,021
  • 9
  • 21
  • for me also changing to UTF-8 solved the problem. On windows I used Notepad++ to change the format from UTF16 to UTF8. I then opened the file with libreoffice calc, and cleared extra lines etc. – Yuval Harpaz Jun 11 '18 at 11:31
1

pandas.read_csv now handles the different UTF encoding when reading/writing and therefore can deal directly with null bytes

data = pd.read_csv(file, encoding='utf-16')

see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

1

This is long settled, but I ran across this answer because I was experiencing an unexpected error while reading a CSV to process as training data in Keras and TensorFlow.

In my case, the issue was much simpler, and is worth being conscious of. The data being produced into the CSV wasn't consistent, resulting in some columns being completely missing, which seems to end up throwing this error as well.

The lesson: If you're seeing this error, verify that your data looks the way that you think it does!

David Hoelzer
  • 15,862
  • 4
  • 48
  • 67
0

This error seems to be fixed in latest python, using python 3.11.3 I don't get the error anymore. https://github.com/python/cpython/pull/28808

Mourad
  • 75
  • 1
  • 6
-1

The above information is great. For me I had this same error. My fix was easy and just user error aka myself. Simply save the file as a csv and not an excel file.

JQTs
  • 142
  • 2
  • 11
-2

It is very simple.

don't make a csv file by "create new excel" or save as ".csv" from window.

simply import csv module, write a dummy csv file, and then paste your data in that.

csv made by python csv module itself will no longer show you encoding or blank line error.

  • This answer does not provide any solution on how to manipulate input data, but rather how to "fix" input data. More often then not, the input data is not manageable. – despina Sep 01 '21 at 11:10