Python line.replace returns UnicodeEncodeError

Question

I have a tex file that was generated from rst source using Sphinx, it is encoded as UTF-8 without BOM (according to Notepad++) and named final_report.tex, with following content:

% Generated by Sphinx.
\documentclass[letterpaper,11pt,english]{sphinxmanual}
\usepackage[utf8]{inputenc}
\begin{document}

\chapter{Preface}
Krimson4 is a nice programming language.
Some umlauts äöüßÅö.
That is an “double quotation mark” problem.
Johnny’s apostrophe allows connecting multiple ports.
Components that include data that describe how they ellipsis …
Software interoperability – some dash – is not ok.
\end{document}

Now, before I compile the tex source to pdf, I want to replace some lines in the tex file to get nicer results. My script was inspired by another SO question.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os

newFil=os.path.join("build", "latex", "final_report.tex-new")
oldFil=os.path.join("build", "latex", "final_report.tex")

def freplace(old, new):
    with open(newFil, "wt", encoding="utf-8") as fout:
        with open(oldFil, "rt", encoding="utf-8") as fin:
            for line in fin:
                print(line)
                fout.write(line.replace(old, new))
    os.remove(oldFil)
    os.rename(newFil, oldFil)

freplace('\documentclass[letterpaper,11pt,english]{sphinxmanual}', '\documentclass[letterpaper, 11pt, english]{book}')

This works on Ubuntu 16.04 with Python 2.7 as well as Python 3.5, but it fails on Windows with Python 3.4. The error message I get is:

File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 11: character maps to <undefined>

where 201c stands for left double quotation mark. If I remove the problematic character, the script proceeds till it finds the next problematic character.

In the end, I need a solution that works on Linux and Windows with Python 2.7 and 3.x. I tried quite a lot of the solutions suggested here on SO, but could not yet find one that works for me...

My example does not have 19 lines, I assume the error message refers to line 19 of the `cp850.py` file. — matth, Jul 04 '16 at 14:29
related: http://stackoverflow.com/questions/10971033/backporting-python-3-openencoding-utf-8-to-python-2 — matth, Jul 06 '16 at 14:32

score 2 · Accepted Answer · answered Jul 04 '16 at 14:17

2

You need to specify the correct encoding with the encoding="the_encoding":

with open(oldFil, "rt", encoding="utf-8") as fin,  open(newFil, "wt", encoding="utf-8") as fout:

If you don't the preferred encoding will be used.

open

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding

answered Jul 04 '16 at 14:17

Padraic Cunningham

176,452
29
245
321

@matth, what double quote? If you still have encoding issues then you don't have utf-8 encoded data – Padraic Cunningham Jul 04 '16 at 14:29
@matth, you have specified the encoding as utf-8 for both annd the error happens on write? – Padraic Cunningham Jul 04 '16 at 15:15
@matth, so setting the encoding fixed the first error but now you have another? – Padraic Cunningham Jul 04 '16 at 19:05
Yes it fixed the first error, but not my real world use case, I am still getting the UnicodeEncodeError. So I updated the question to be closer to my real use case. – matth Jul 05 '16 at 06:10
All other errors disappear when I delete the `print()` statement used for debugging. My `sys.stdout.encoding` is `cp850` which is not able to display the unicode character. – matth Jul 06 '16 at 13:53
All other errors disappear when I delete the print() statement used for debugging. My sys.stdout.encoding is cp850 which is not able to display the unicode character. Any idea how to print to stdout (I do not need it now, but it would be nice to know for future debugging). Also, Python 2 is now complaining about the encoding. – matth Jul 06 '16 at 14:10
1

For python 2 you would need to use the io lib, using io.open. when I get back on my comp this evening I will add a link to a nice answer that allows you to print from a cmd shell, although I would recommend using cygwin as your default shell on windows or use an side like pycharm. – Padraic Cunningham Jul 06 '16 at 14:26
`io.open` did the trick, thanks. Should I test for the python version before importing or is it acceptable to import no matter whether py2 or 3? – matth Jul 06 '16 at 14:32
1

No need, Python3's open is io.open so the code will work as is for both 2.7 and 3 – Padraic Cunningham Jul 06 '16 at 14:35

Python line.replace returns UnicodeEncodeError

1 Answers1