Python and parsing unicode files

Question

A few weeks ago I wrote a CSV parser in python and it was working great with the provided text file. But when we tried to test is with other files the problems started.

First was the

ValueError: empty string for float()

for a string like "313.44". The problem was that in unicode there was some empty bytes betwee the numbers '\x0'.

Ok I decoded to read it as an unicode with

codecs.open(filename, 'r', 'utf-16')

And then the hell opened, missing BOM, problems with the line end characters (LF vs CR+LF) etc.

So can you provide me or give me hint for a workaround about parsing unicode and non-unicode files if I do not know what the encoding is, is BOM present, what line ending are etc.

P.S. I am using Python 2.7

Why are you writing a csv parser and not just using the `csv` module? — Daenyth, Mar 29 '11 at 14:44

score 1 · Accepted Answer · answered Feb 06 '12 at 20:23

1

The problem was solved using the csv module as proposed by Daenyth

answered Feb 06 '12 at 20:23

Ilian Iliev

3,217
4
26
51

score 0 · Answer 2 · edited May 23 '17 at 10:33

0

It mainly depends on the Python version you are using but those 2 links shopuld help you out:

edited May 23 '17 at 10:33

Community

1
1

answered Mar 29 '11 at 13:37

Moss

6,002
1
35
40

Python and parsing unicode files

2 Answers2