How to decode u'\xc3\xa9cosyst\xc3\xa8mes' to utf-8

Question

From webscraping with BeautifulSoup, I get a query string parameter that end up being represented as:

param_value = u'\xc3\xa9cosyst\xc3\xa8mes'

When reading it, I can guess that it should be represented as écosytèmes

I tried several way to encode / escape / decode (as described here and here)

But I keep on getting errors like:

UnicodeEncodeError('ascii', u'\xc3\xa9cosyst\xc3\xa8mes', 0, 2, 'ordinal not in range(128)')

I also tried the solution proposed as duplicate:

Python 2.7.15 (default, Jul 23 2018, 21:27:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'\xc3\xa9cosyst\xc3\xa8mes'
>>> s.encode('latin-1').decode('utf-8')
u'\xe9cosyst\xe8mes'

but it gets me back to square 1...

How can I get from u'\xc3\xa9cosyst\xc3\xa8mes' to u'écosystèmes'?

Related: [Fixing mojibakes in UTF-8 text](https://stackoverflow.com/questions/48430825/fixing-mojibakes-in-utf-8-text). What you have looks like UTF-8 decoded as latin-1. — Ilja Everilä, Mar 24 '19 at 11:32
`u'\xe9cosyst\xe8mes'` is the correct unicode string value. You should now read [Understanding repr( ) function in Python](https://stackoverflow.com/questions/7784148/understanding-repr-function-in-python) — Ilja Everilä, Mar 24 '19 at 13:45

score 1 · Accepted Answer · answered Mar 24 '19 at 11:31

1

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

>>> s = u'\xc3\xa9cosyst\xc3\xa8mes'
>>> s.encode('latin-1').decode('utf-8')
u'\xe9cosyst\xe8mes'
>>> print s.encode('latin-1').decode('utf-8')
écosystèmes

answered Mar 24 '19 at 11:31

snakecharmerb

47,570
11
100
153

it only only gets me back to square 1... ``` >>> s = u'\xc3\xa9cosyst\xc3\xa8mes' >>> s.encode('latin-1') '\xc3\xa9cosyst\xc3\xa8mes' >>> s.encode('latin-1').decode('utf-8') u'\xe9cosyst\xe8mes' ``` – E. Jaep Mar 24 '19 at 12:04
2

That's not square one - that's the solution. The `repr` doesn't necessarily show the decoded text; but try to `print` it (on a device which can handle Unicode). – tripleee Mar 24 '19 at 12:11

user38 · Answer 2 · 2019-03-24T11:55:25.663

0

I think this will help: bytes(u'\xc3\xa9cosyst\xc3\xa8mes', 'latin-1').decode('utf-8')

edited Mar 24 '19 at 11:55

answered Mar 24 '19 at 11:31

user38

151
1
14

It will, trivially, if you can figure out how to convert the `u''` string in the question into this `b''` bytestring; but that is obviously the nontrivial crux of this question. – tripleee Mar 24 '19 at 11:44
Like so: `bytes(u'\xc3\xa9cosyst\xc3\xa8mes', 'latin-1').decode('utf-8')` This should work now – user38 Mar 24 '19 at 11:55

How to decode u'\xc3\xa9cosyst\xc3\xa8mes' to utf-8

2 Answers2