0

I have a string from bs4 that is

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"

\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as

vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

with

str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))

but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.

The way it is supposed to be is

311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

EDIT: Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)

After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).

  • 2
    Something's fishy here. How did you even get a string like that? I'm pretty sure you already did something wrong, and fixing the problem upstream is better than retroactively cleaning up the mess. – Aran-Fey Oct 17 '18 at 07:46
  • 1
    What encoding was the original data created in? `ISO-8859-1`? Should be defined in the top of your HTML file. – Torxed Oct 17 '18 at 07:47
  • Possible duplicate of [Convert Unicode Escape to Hebrew text](https://stackoverflow.com/questions/52457095/convert-unicode-escape-to-hebrew-text) – Andrey Tyukin Oct 17 '18 at 10:25
  • 2
    `input.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')` works, seems to be the same as in the proposed duplicate. @Aran-Fey Fixing it could be difficult if the broken upstream source is Facebook itself. – Andrey Tyukin Oct 17 '18 at 10:27
  • It came from a bs4 scrape of a website. The site called a backend php script that outputs html body but with all '\' as '\\' and a bunch of '\\t' and '\\n'. I just regex over the entire thing to make '\\' to '\', but I have no control over their encoding. I can't get the encoding from head because the script only produces html body. @AndreyTyukin suggestion works for my data though. – I should change my Username Oct 20 '18 at 00:05

2 Answers2

1

Assuming Python 2:

This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:

>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:

>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'

Now it can be decoded correctly as UTF-8:

>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'

It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:

>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html

Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
1

Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':

s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')

gives:

'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'

It works for the same reason as explained in this answer.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93