I am reading a txt file into Python, extract parts of it and then output the results as CSV.
Problem is, I get encoding hiccups I don't know how to resolve. Here is what happens:
The input file is an export from Adobe Acrobat, where I output a "plain text file" after changing the setting to "UTF-8".
I then read this into Python like so
inputfile=codecs.open(inputfile, "r", "utf-8")
I run a regex to extract parts of it, make them into a pandas
DataFrame
(called 'dataframe' here).Then it write the dataframe out as a csv file, but no matter how I do it I run into problems. I tried
outputfile=codecs.open(outputfile, "w", "utf-8") dataframe.to_csv(outputfile, encoding="utf-8")
but that gives me an
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)
Question:
This is the first thing I don't get: why is there an 'ascii'-codec involved if I am setting the to_csv encoding="utf-8"
? Acc to the docs, this is a
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2...
I can avoid this error by not specifying "utf-8" in codecs.open()
. But then, once I import the file into Excel (setting import to "Unicode-Utf-8"), all ' characters show up as __. There are no other encoding errors as far as I can see and if I open the csv file in TextWrangler, everything works out just fine.
Where is the problem here -- is it the output from pandas, is it Excel, or what else?
Is there a better way of tackling this?
I am working in Python 2 on a Mac. I did not use the Python csv module because it doesn't do UTF without a workaround.
Any help is appreciated!
EDIT: Here is what the inputfile looks like in WordWrangler:
23 It’s lying down there on the floor.
Here is Excel:
It__s lying down there on the floor.
After Fawful's helpful comment, I also tried to open the original text file in Excel. Seems like it already encodes the ' as __ in that one.