3

I am reading a txt file into Python, extract parts of it and then output the results as CSV.

Problem is, I get encoding hiccups I don't know how to resolve. Here is what happens:

  1. The input file is an export from Adobe Acrobat, where I output a "plain text file" after changing the setting to "UTF-8".

  2. I then read this into Python like so

    inputfile=codecs.open(inputfile, "r", "utf-8")
    
  3. I run a regex to extract parts of it, make them into a pandas DataFrame (called 'dataframe' here).

  4. Then it write the dataframe out as a csv file, but no matter how I do it I run into problems. I tried

    outputfile=codecs.open(outputfile, "w", "utf-8")
    dataframe.to_csv(outputfile, encoding="utf-8")
    

but that gives me an

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23:  ordinal not in range(128)

Question:

This is the first thing I don't get: why is there an 'ascii'-codec involved if I am setting the to_csv encoding="utf-8"? Acc to the docs, this is a

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2...

I can avoid this error by not specifying "utf-8" in codecs.open(). But then, once I import the file into Excel (setting import to "Unicode-Utf-8"), all ' characters show up as __. There are no other encoding errors as far as I can see and if I open the csv file in TextWrangler, everything works out just fine.

  • Where is the problem here -- is it the output from pandas, is it Excel, or what else?

  • Is there a better way of tackling this?

I am working in Python 2 on a Mac. I did not use the Python csv module because it doesn't do UTF without a workaround.

Any help is appreciated!

EDIT: Here is what the inputfile looks like in WordWrangler:

23 It’s lying down there on the floor.

Here is Excel:

It__s lying down there on the floor.

After Fawful's helpful comment, I also tried to open the original text file in Excel. Seems like it already encodes the ' as __ in that one.

patrick
  • 4,455
  • 6
  • 44
  • 61

3 Answers3

1

Not a clean solution, but for a quick fix just use .replace('\0xe2', ' ').

  • Thanks, that would indeed work! I'd rather avoid that though as I have a couple more files to process and I'm afraid the hacks might accumulate... – patrick Jun 27 '16 at 17:54
1

When writing the file from your pandas dataframe, do not use a codecs file object. pandas.to_csv() already encodes your data, and the codecs file object then has to try to decode (as ASCII) in order to be able to re-encode it.

Just use a regular file:

with open(outputfile, "w") as outputfile:
    dataframe.to_csv(outputfile, encoding="utf-8")

You can use the csv module as well, but then you have to encode all your row data to UTF-8 before passing the row to the csv.writer().writerow() function. The csv module examples section includes code that automates this for you.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks! I have a quick question re `(outputfile, "w", "utf-8")` -- this does not work for me, it asks for an "integer" instead. Can [open](https://docs.python.org/2/library/functions.html#open) really take the "utf-8" setting? I couldn't find it in the docs either... If I get rid of the utf-8 part, things work but Excel still messes up in the same way. I'd think it's an issue with Acrobat's export, but then everything works just fine in WordWrangler... – patrick Jun 27 '16 at 18:44
  • @patrick: sorry, that was my mistake, a copy and paste error. – Martijn Pieters Jun 27 '16 at 18:45
  • ah, okay! It's good to know about the pandas-encoding thing though. I'll give the csv module a try but beginning to think it's not really something that I am doing but Excel or Acrobat. – patrick Jun 27 '16 at 18:49
0

So looks like this is an Excel (at least Mac Excel 2011) issue; appears that while it does offer an "Unicode UTF-8" setting, it actually can't deal with unicode characters.

See e.g. this SO post and this post from Microsoft. My files open just fine in Numbers. Didn't even know I had that on this machine...

patrick
  • 4,455
  • 6
  • 44
  • 61