1

From webscraping with BeautifulSoup, I get a query string parameter that end up being represented as:

param_value = u'\xc3\xa9cosyst\xc3\xa8mes'

When reading it, I can guess that it should be represented as écosytèmes

I tried several way to encode / escape / decode (as described here and here)

But I keep on getting errors like:

UnicodeEncodeError('ascii', u'\xc3\xa9cosyst\xc3\xa8mes', 0, 2, 'ordinal not in range(128)')

I also tried the solution proposed as duplicate:

Python 2.7.15 (default, Jul 23 2018, 21:27:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'\xc3\xa9cosyst\xc3\xa8mes'
>>> s.encode('latin-1').decode('utf-8')
u'\xe9cosyst\xe8mes'

but it gets me back to square 1...

How can I get from u'\xc3\xa9cosyst\xc3\xa8mes' to u'écosystèmes'?

E. Jaep
  • 2,095
  • 1
  • 30
  • 56
  • 1
    Related: [Fixing mojibakes in UTF-8 text](https://stackoverflow.com/questions/48430825/fixing-mojibakes-in-utf-8-text). What you have looks like UTF-8 decoded as latin-1. – Ilja Everilä Mar 24 '19 at 11:32
  • `u'\xe9cosyst\xe8mes'` is the correct unicode string value. You should now read [Understanding repr( ) function in Python](https://stackoverflow.com/questions/7784148/understanding-repr-function-in-python) – Ilja Everilä Mar 24 '19 at 13:45

2 Answers2

1

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

>>> s = u'\xc3\xa9cosyst\xc3\xa8mes'
>>> s.encode('latin-1').decode('utf-8')
u'\xe9cosyst\xe8mes'
>>> print s.encode('latin-1').decode('utf-8')
écosystèmes
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • it only only gets me back to square 1... ``` >>> s = u'\xc3\xa9cosyst\xc3\xa8mes' >>> s.encode('latin-1') '\xc3\xa9cosyst\xc3\xa8mes' >>> s.encode('latin-1').decode('utf-8') u'\xe9cosyst\xe8mes' ``` – E. Jaep Mar 24 '19 at 12:04
  • 2
    That's not square one - that's the solution. The `repr` doesn't necessarily show the decoded text; but try to `print` it (on a device which can handle Unicode). – tripleee Mar 24 '19 at 12:11
0

I think this will help: bytes(u'\xc3\xa9cosyst\xc3\xa8mes', 'latin-1').decode('utf-8')

user38
  • 151
  • 1
  • 14
  • It will, trivially, if you can figure out how to convert the `u''` string in the question into this `b''` bytestring; but that is obviously the nontrivial crux of this question. – tripleee Mar 24 '19 at 11:44
  • Like so: `bytes(u'\xc3\xa9cosyst\xc3\xa8mes', 'latin-1').decode('utf-8')` This should work now – user38 Mar 24 '19 at 11:55