-1
sys.getdefaultencoding()
-> utf8
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempête # WTF ?

sys.setdefaultencoding('ascii')
sys.getdefaultencoding()
-> ascii
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempête #...

I observe these results when I do a set_trace() from pdb.

In a python2.7 shell I have correct result :

sys.getdefaultencoding()
-> ascii
test = u'tempête'
test
-> u'temp\xc3\xaate'
print(test)
-> tempête # WTF ?

I'm struggling with this from hours...

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
Pierre Criulanscy
  • 8,726
  • 3
  • 24
  • 38
  • I cannot recreate the problem with python 2.7 – meltdown90 Feb 24 '16 at 16:10
  • In python2.7 shell, I get `AttributeError: 'NoneType' object has no attribute 'CodecInfo'` – zondo Feb 24 '16 at 16:16
  • Please show us the shell output directly. – thebjorn Feb 24 '16 at 19:32
  • Python 2.x should never have a default encoding of UTF-8. You would've had to `reload(sys)` to make this work, which should tell you that it's not supposed to be played with. – Alastair McCormack Feb 25 '16 at 08:59
  • 1
    I'm voting to close this question as off-topic because the OP has created a more accurate representation of the problem here: http://stackoverflow.com/questions/35648216/python-scrapy-bad-utf8-characters-writed-in-file-from-scraped-html-page-with – Alastair McCormack Feb 26 '16 at 09:55

1 Answers1

1

Ensure your locale encoding matches your terminal emulation. Type locale to check.

sys.setdefaultencoding() has nothing to do with printing - Python uses your locale to set the stdout encoding used when printing. See sys.stdout.encoding.

I can partially replicate your problem like this:

  1. Set terminal emulation to: UTF-8
  2. Set locale to en_GB.ISO8859-1. I.e. Not UTF-8

    export LANG=en_GB.ISO8859-1
    
  3. Run your code:

    >>> test = u'tempête'
    >>> test
    u'temp\xc3\xaate'
    

The fact that ê becomes à (U+00C3) and ª(U+00AA) is the crux of the problem, showing that Python thought the encoding of should be an 8bit character set.

I can't replicate your final print but I suspect fiddling with setdefaultencoding() and cooked everything - See my answer about why it's a bad idea: https://stackoverflow.com/a/34378962/1554386

Community
  • 1
  • 1
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Thank your for your answer and the one you linked to, very interseting ! But I still have a problem, not sure if it's related finally...I'm scraping a web page with charset `iso-8859-1`, scrapy returns an utf-8 unicode response. The text to scrap is 'tempête', I've put a `pdb.set_trace()` statement to check the response and it's correct : `u'temp\xc3\xaate'`, but when I tried to `print` it, the same problem as above occurs : `print(response) -> tempête`. I'm trying to put this response in json file, and that's also what is printed in the json file instead of 'tempête'. Any clue on this ? – Pierre Criulanscy Feb 25 '16 at 09:19
  • Ok, this is a typical [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Too many assumptions have been made when the problem is probably quite simple in the original code. Please update the question with the actual scraping code instead. – Alastair McCormack Feb 25 '16 at 10:14
  • Maybe should I create a whole new questions to keep the coms & answer in sync with the old subject ? – Pierre Criulanscy Feb 26 '16 at 08:36
  • Good idea. Create a new question and paste a link here so I can help you again. – Alastair McCormack Feb 26 '16 at 08:38
  • Here the new question : http://stackoverflow.com/questions/35648216/python-scrapy-bad-utf8-characters-writed-in-file-from-scraped-html-page-with – Pierre Criulanscy Feb 26 '16 at 09:44