13

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t

The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.

My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?

Python calling code:

import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)

I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.

philipxy
  • 14,867
  • 6
  • 39
  • 83
Wolf
  • 141
  • 1
  • 2
  • 9
  • 2
    show your python codes please . – Raptor Jan 02 '13 at 07:30
  • 2
    Converting 'à' to 'a' is not like converting to UTF-8. [UTF-8](http://en.wikipedia.org/wiki/UTF-8) is in fact a text encoding designed to encode characters like 'à' which fall outside the [basic ASCII character set](http://en.wikipedia.org/wiki/ASCII). – Phil Frost Jan 02 '13 at 12:57
  • 2
    Possible duplicate of [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) – phuclv Apr 04 '18 at 13:49

5 Answers5

51

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?

Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

Explicit example...

>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>

How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).

Mike Pennington
  • 41,899
  • 19
  • 136
  • 174
  • Thats great Mike. This may be a bit of a nooby python question, but, is it possible to insert a string, and for the unicodedata.norm function to find any unicode escape chars, and normalize them? or do i just have to regex the unicode out and normalize each one? – Wolf Jan 02 '13 at 12:21
  • When you call `unicodedata.normalize()` as I did above, it finds all unicode and normalizes them into ASCII. All you need to do is read the unicode file into a string, call `unicodedata.normalize()` on that string, and save the output to a new filename. – Mike Pennington Jan 02 '13 at 12:48
  • Actually, `unicodedata.normalize()` does not convert the string to ASCII; it performs the canonical decomposition (basically breaking multi-part characters into components); see [docs (Python 3.6)](https://docs.python.org/3.6/library/unicodedata.html#unicodedata.normalize). The `str.encode('ascii', 'ignore')` function converts to ASCII, ignoring errors that would otherwise occur with non-ASCII characters. See docs on [str.encode](https://docs.python.org/3/library/stdtypes.html#str.encode) and [error handlers](https://docs.python.org/3/library/codecs.html#error-handlers). – ASL Jun 19 '17 at 15:15
  • Thank you for correcting my comment above. I took the liberty of editing this information into my answer. – Mike Pennington Jun 29 '17 at 12:33
  • I think that for most use-cases that should be the `NFKD` normalization rather than `NFD`. The former is lossy in Unicode terms, but that doesn't matter here. The ‘compatibility’ part means that a larger set of Unicode characters are mapped to probably-right output. Compare the result of normalizing then ascii-fying `"Dvořák £①"` with `NFKD` and `NFD`: in both cases the ‘£’ sign disappears, but in the former case the ‘①’ ends up as a `1` rather than disappearing. That's _probably_ the desired effect in many cases. – Norman Gray May 25 '22 at 11:50
  • ...and (a parenthetical remark after re-examining other answers here) one almost certainly doesn't want the `NFC` or `NFKC` normalizations, since they decompose and then _re_compose, so that the subsequent ascii-fy step _removes_ any non-ASCII characters rather than converting them to their nearest ASCII equivalent. – Norman Gray May 25 '22 at 11:55
5

@Mike Pennington's solution works great thanks to him. but when I tried that solution I notice that it fails some special characters (i.e. ı character from Turkish alphabet) which has not defined at NFD.

I discovered another solution which you can use unidecode library to this conversion.

>>>import unidecode
>>>example = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz"


#convert it to utf-8
>>>utf8text = unicode(example, "utf-8")

>>> print utf8text
ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz

#convert utf-8 to ascii text
asciitext = unidecode.unidecode(utf8text)

>>>print asciitext

ABCCDEFGGHIIJKLMNOOPRSSTUUVYZabccdefgghiijklmnooprsstuuvyz
wolfiem
  • 165
  • 1
  • 3
  • 10
2

I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:

# ~*~ coding: utf-8 ~*~
import re

def remove_accents(string):
    if type(string) is not unicode:
        string = unicode(string, encoding='utf-8')

    string = re.sub(u"[àáâãäå]", 'a', string)
    string = re.sub(u"[èéêë]", 'e', string)
    string = re.sub(u"[ìíîï]", 'i', string)
    string = re.sub(u"[òóôõö]", 'o', string)
    string = re.sub(u"[ùúûü]", 'u', string)
    string = re.sub(u"[ýÿ]", 'y', string)

    return string

I like that function because you can customize it in case you need to ignore other characters

AlvaroAV
  • 10,335
  • 12
  • 60
  • 91
  • SyntaxError: Non-ASCII character '\xc3' in file source.py on line 65, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details – Anoyz Mar 30 '18 at 21:52
  • 2
    Need to ad to the beginning of file: # -*- coding: utf-8 -*- – Anoyz Mar 30 '18 at 21:55
0

The given URL returns UTF-8 as the HTTP response clearly indicates:

wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40--  http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Server: Apache
  Cache-Control: private
  Content-Type: text/html;charset=UTF-8
  Date: Wed, 02 Jan 2013 07:43:40 GMT
  Transfer-Encoding:  chunked
  Connection: keep-alive
  Connection: Transfer-Encoding
  Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
  Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
  Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
  Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]

Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.

  • Yes, that's true, but the OP didn't really mean that he wanted to convert characters to UTF-8. He wanted to convert them to ASCII. – LarsH Jun 04 '19 at 13:45
0

the issue was different for me but this stack page works to resolved it unicodedata.normalize('NFKC', 'V').encode('ascii', 'ignore') output - b'V'

Amir
  • 131
  • 3