Special HTML characters in Python to ASCII

Question

I want to convert special characters which I see during web-page reading to the ASCII format. I've tried a lot, but I can't figure it out. I will give some examples below which are stored in a string in Python.I don't know what the current encoding of the web-page is, but I want to convert it to ASCII format.

ApaydÄ±n Ãœnal > want this to Apaydin Unal
Íñigo Martínez > want this to Inigo Martinez
ÃœstÃ¼nel > want this to Ustunel

Who can help me?

EDIT: Thanks, I forgot. I'm using Python 2.7

What Python version are you using? Assuming Python2, `MyString.encode('iso-8859-1')`. Encodings are heavily depended on the console or output you're using, the version of Python and depending on the format the data arrives in you convert it a little bit differently. — Torxed, Mar 24 '14 at 08:50
This results in errors. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128). When trying to convert string 'SÃ¶nmez' — Coryza, Mar 24 '14 at 08:52
`x.decode('utf-8').encode('iso-8859-1', errors='replace')` will decode the string into something that ascii can understand, and then re-encode it into a character encoding your output understands. which produces: `'\xc3?st\xc3\xbcnel > want this to Ustunel'` in my console most likely because i'm not using **iso-8859-1**, i'm actually using UTF-8. Also mentioning where you read this output would be nice, webbrowser, console (if so, which OS), text-file? — Torxed, Mar 24 '14 at 08:55
I'm reading this from m.facebook.com website. This results in strings like Y?ld?z and K??lal?, which is not the desired result. — Coryza, Mar 24 '14 at 09:00
When reading date from facebook your are most likely not using UTF-8 connection thus making your data look like this. — ek9, Mar 24 '14 at 09:01
@Coryza Would be nice to see the code you're using to retrieve the facebook data. First off, you should be using their API. Secondly, if you're using sockets, you need to manually determain the encoding, one easy way to do it is to check the headers from the GET data and also the block.. can't remember off the top of my head.) — Torxed, Mar 24 '14 at 09:57

score 1 · Accepted Answer · edited May 23 '17 at 12:23

1

Give https://pypi.python.org/pypi/Unidecode a try:

>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'

And to detect the encoding, see the question Determine the encoding of text in Python

edited May 23 '17 at 12:23

Community

1
1

answered Mar 24 '14 at 09:04

Marco Mariani

13,556
6
39
55

Special HTML characters in Python to ASCII

1 Answers1