Python2.7 prints wrong characters from unicode with hexadecimal chars

Question

sys.getdefaultencoding()
-> utf8
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempÃªte # WTF ?

sys.setdefaultencoding('ascii')
sys.getdefaultencoding()
-> ascii
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempÃªte #...

I observe these results when I do a set_trace() from pdb.

In a python2.7 shell I have correct result :

sys.getdefaultencoding()
-> ascii
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempÃªte # WTF ?

I'm struggling with this from hours...

In python2.7 shell, I get `AttributeError: 'NoneType' object has no attribute 'CodecInfo'` — zondo, Feb 24 '16 at 16:16
Python 2.x should never have a default encoding of UTF-8. You would've had to `reload(sys)` to make this work, which should tell you that it's not supposed to be played with. — Alastair McCormack, Feb 25 '16 at 08:59
I'm voting to close this question as off-topic because the OP has created a more accurate representation of the problem here: http://stackoverflow.com/questions/35648216/python-scrapy-bad-utf8-characters-writed-in-file-from-scraped-html-page-with — Alastair McCormack, Feb 26 '16 at 09:55

score 1 · Accepted Answer · edited May 23 '17 at 11:50

1

Ensure your locale encoding matches your terminal emulation. Type locale to check.

sys.setdefaultencoding() has nothing to do with printing - Python uses your locale to set the stdout encoding used when printing. See sys.stdout.encoding.

I can partially replicate your problem like this:

Set terminal emulation to: UTF-8
Set locale to en_GB.ISO8859-1. I.e. Not UTF-8
```
export LANG=en_GB.ISO8859-1
```

Run your code:

>>> test = u'tempête'
>>> test
u'temp\xc3\xaate'

The fact that ê becomes Ã (U+00C3) and ª(U+00AA) is the crux of the problem, showing that Python thought the encoding of should be an 8bit character set.

I can't replicate your final print but I suspect fiddling with setdefaultencoding() and cooked everything - See my answer about why it's a bad idea: https://stackoverflow.com/a/34378962/1554386

edited May 23 '17 at 11:50

Community

1
1

answered Feb 24 '16 at 23:16

Alastair McCormack

26,573
8
77
100

Thank your for your answer and the one you linked to, very interseting ! But I still have a problem, not sure if it's related finally...I'm scraping a web page with charset `iso-8859-1`, scrapy returns an utf-8 unicode response. The text to scrap is 'tempête', I've put a `pdb.set_trace()` statement to check the response and it's correct : `u'temp\xc3\xaate'`, but when I tried to `print` it, the same problem as above occurs : `print(response) -> tempÃªte`. I'm trying to put this response in json file, and that's also what is printed in the json file instead of 'tempête'. Any clue on this ? – Pierre Criulanscy Feb 25 '16 at 09:19
Ok, this is a typical [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Too many assumptions have been made when the problem is probably quite simple in the original code. Please update the question with the actual scraping code instead. – Alastair McCormack Feb 25 '16 at 10:14
Maybe should I create a whole new questions to keep the coms & answer in sync with the old subject ? – Pierre Criulanscy Feb 26 '16 at 08:36
Good idea. Create a new question and paste a link here so I can help you again. – Alastair McCormack Feb 26 '16 at 08:38
Here the new question : http://stackoverflow.com/questions/35648216/python-scrapy-bad-utf8-characters-writed-in-file-from-scraped-html-page-with – Pierre Criulanscy Feb 26 '16 at 09:44

Python2.7 prints wrong characters from unicode with hexadecimal chars

1 Answers1