-1

I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)

When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.

Anyone who can help me?

Thanks.

Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.

That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).

# Printing the raw string in Python results in "Yalçınkaya". 
# When applying unicode to utf8 the string changes to  'Yalçınkaya'. 
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'. 
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted. 
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Community
  • 1
  • 1
  • Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a python unicode string?) but that did not solve my problem. – Johan Eriksen Jul 19 '14 at 13:41
  • 1
    I've re-opened the question, perhaps you could state exactly why that didn't solve your problem so you get suitable suggestions... – Jon Clements Jul 19 '14 at 13:42
  • Thank you for your edit. One more thing though: could you provide a [short, self-contained, correct example](http://sscce.org/) demonstrating you issue -- in order for us to be able to reproduce the problem? – Sylvain Leroux Jul 19 '14 at 13:56
  • 1
    Did you try the `unidecode` library mentioned in the accepted answer to that question? It turns `'Yalçınkaya'` into `'Yalcinkaya'` for me. – DSM Jul 19 '14 at 14:00
  • You are indeed correct. Not sure how I missed that, but it works! Thanks a lot. (How can I accept your answer)?) – Johan Eriksen Jul 19 '14 at 14:08
  • Since the answer I'd write would be the same and the question is close enough, I'm closing as a dup of the original target. – DSM Jul 19 '14 at 14:24

1 Answers1

0

Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).

That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.

Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
jsbueno
  • 99,910
  • 10
  • 151
  • 209