0

I want to convert special characters which I see during web-page reading to the ASCII format. I've tried a lot, but I can't figure it out. I will give some examples below which are stored in a string in Python.I don't know what the current encoding of the web-page is, but I want to convert it to ASCII format.

Apaydın Ünal > want this to Apaydin Unal
Íñigo Martínez > want this to Inigo Martinez
Üstünel > want this to Ustunel

Who can help me?

EDIT: Thanks, I forgot. I'm using Python 2.7

Coryza
  • 231
  • 1
  • 3
  • 12
  • What Python version are you using? Assuming Python2, `MyString.encode('iso-8859-1')`. Encodings are heavily depended on the console or output you're using, the version of Python and depending on the format the data arrives in you convert it a little bit differently. – Torxed Mar 24 '14 at 08:50
  • This results in errors. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128). When trying to convert string 'Sönmez' – Coryza Mar 24 '14 at 08:52
  • `x.decode('utf-8').encode('iso-8859-1', errors='replace')` will decode the string into something that ascii can understand, and then re-encode it into a character encoding your output understands. which produces: `'\xc3?st\xc3\xbcnel > want this to Ustunel'` in my console most likely because i'm not using **iso-8859-1**, i'm actually using UTF-8. Also mentioning where you read this output would be nice, webbrowser, console (if so, which OS), text-file? – Torxed Mar 24 '14 at 08:55
  • I'm reading this from m.facebook.com website. This results in strings like Y?ld?z and K??lal?, which is not the desired result. – Coryza Mar 24 '14 at 09:00
  • When reading date from facebook your are most likely not using UTF-8 connection thus making your data look like this. – ek9 Mar 24 '14 at 09:01
  • @Coryza Would be nice to see the code you're using to retrieve the facebook data. First off, you should be using their API. Secondly, if you're using sockets, you need to manually determain the encoding, one easy way to do it is to check the headers from the GET data and also the block.. can't remember off the top of my head.) – Torxed Mar 24 '14 at 09:57

1 Answers1

1

Give https://pypi.python.org/pypi/Unidecode a try:

>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'

And to detect the encoding, see the question Determine the encoding of text in Python

Community
  • 1
  • 1
Marco Mariani
  • 13,556
  • 6
  • 39
  • 55