0

(This question is related to this one)

I am reading parsing csv content, which has been previously loaded into memory:

def ReadTxtIntoColumns(txt, columns):
    rows = []
    print txt
    print txt.find('\x00')
    print txt.count('\x00')
    f = StringIO(txt)
    try:
        reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
        for row in reader:
            # Merge all extra columns
            if len(row) >= columns:
                rest = ' '.join(row[columns-1:])
                del row[columns-1:]
                row.append(rest)
            # And now set missing columns to None
            for i in range (len (row), columns):
                row.append(None)
            rows.append(row)
    except csv.Error, e:
        log.error('ReadTxtIntoColumns > Problems reading csv from string > line %d: %s', reader.line_num, e)
    finally:
        f.close()
    return rows

The input data has been hand-written by me (a simple textfile, space separated). There is no '\x00' at all in my input data, but csv complains about it:

ReadTxtIntoColumns > Problems reading csv from string > line 1: line contains NULL byte

What does the error message mean then?

EDIT

This is my simplified input data, which I have verified is still causing the same problem:

#COMMAND                      USER        DIRECTORY                        SAFE   COMMAND
uname                         -            -                               FALSE  uname -a
sleep                         -            -                               FALSE  sleep 100
Community
  • 1
  • 1
blueFast
  • 41,341
  • 63
  • 198
  • 344
  • You don't need to wrap the input in a `StringIO` object; just pass in any iterable; `text.splitlines(True)` will do fine, for example. – Martijn Pieters Jun 05 '13 at 19:30
  • Can you give us a minimal sample `text` value that reproduces the problem? – Martijn Pieters Jun 05 '13 at 19:31
  • wouldn't `numpy.loadtxt` work for your case? – Saullo G. P. Castro Jun 05 '13 at 19:32
  • @Martijn Pieters: Added input data. The funny thing, I am having this problem in one system, but not in another system. csv library is the same in both. One is python 2.7.3 (ok), the other 2.7.2 (error) – blueFast Jun 05 '13 at 20:18
  • @gonvaled: Is there any Unicode involved? – Martijn Pieters Jun 05 '13 at 20:20
  • @Martijn Pieters: there shouldn't be. The input file is just simple text without special characters written in emacs, so I assume is just simple ascii encoding. – blueFast Jun 05 '13 at 20:24
  • @gonvaled: I am talking about the type in Python, `type(txt)` should be `str`, not `unicode`, as `StringIO` won't handle that correctly. – Martijn Pieters Jun 05 '13 at 20:26
  • @Martijn Pieters: let me check again. I am processing the input file with `pystache`, and I have extended my pystache wrapper lately to handle unicode. If I verify the type of `txt` it is indeed unicode, so that could be the problem. The question still remains why that does not cause problem with python 2.7.3. And more important for me: how to parse the unicode csv with the csv module? – blueFast Jun 05 '13 at 20:26
  • @gonvaled: Neither the `csv` module nor the `StringIO` object support `unicode` data. See the warning on top of http://docs.python.org/2/library/csv.html. – Martijn Pieters Jun 05 '13 at 20:28
  • Oh, now I see your comment. So `StringIO` is the problem? I'll check that. How to pass an iterable to the csv reader? Why is python 2.7.3 StrinIO happy about it? – blueFast Jun 05 '13 at 20:28
  • gonvaled: I think @Martijn was suggesting that for data this simple you don't really need to use the `csv` module at all. In that spirit I'd suggest using `for row in txt.splitlines():` instead of `for row in reader:`. – martineau Jun 05 '13 at 20:50
  • @martineau: No; I wasn't necessarily suggesting that; you can pass the result of `txt.splitlines(True)` to the `csv.reader()` object and it'll work, provided `txt` is a byte string, not `unicode`. – Martijn Pieters Jun 05 '13 at 21:10

2 Answers2

3

The csv module contains the following warning:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

The StringIO.StringIO object supports unicode, but if you are using the cStringIO module, then cStringIO.StringIO doesn't, and can lead to more problems.

If your data is ASCII only, simply encode txt first:

txt = txt.encode()

There could have been some fixes added to 2.7.3 that make the problem less visible.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

csv module has problems reading data from Unicode encoded files. Your code worked when I pasted it into the python interpreter and called it with a manually entered text string, so it should work if you try saving the file in ANSI/ASCII format, or converting it to ASCII when loading it into memory.

CCKx
  • 1,303
  • 10
  • 22