How to remove nonAscii characters in python

Question

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

the script you wrote does not throw any errors, what is the problem with non ascii letters?, do you now want it in the file you are writing? — jcr, Oct 21 '15 at 16:04
I know there are no errors, but there are some characters just like "Â" into the HTML that i need to remove. — Poggio, Oct 21 '15 at 16:06
@Poggio may be this will be of help http://stackoverflow.com/questions/17732695/how-to-return-plain-text-from-beautiful-soup-instead-of-unicode — LetzerWille, Oct 21 '15 at 16:24

jcr · Accepted Answer · 2015-10-21T18:47:44.243

4

characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

chr convert a integer to a character, ord converts a character to an integer.

text = ''.join((c for c in str(div) if ord(c) < 128)

this should be your final code

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
text = ''.join((c for c in str(div) if ord(c) < 128)
f.write(text)
f.close()

edited Oct 21 '15 at 18:47

answered Oct 21 '15 at 15:51

jcr

1,015
6
18

Traceback (most recent call last): File "pppp.py", line 38, in div = ''.join((c for c in div if ord(c) < 128)) File "pppp.py", line 38, in div = ''.join((c for c in div if ord(c) < 128)) TypeError: ord() expected string of length 1, but Tag found This is the error – Poggio Oct 21 '15 at 17:25
there should be a str(div), to convert the div tag to a text string, I forgot that – jcr Oct 21 '15 at 18:49
There are some chars i need to handle in a better way, just like the stressed letters. For example: à - è - ì - ò - ù, that i need to print with the rest of the text. Do you know if there is a solution? – Poggio Oct 22 '15 at 13:42

Dušan Maďar · Answer 2 · 2015-10-21T16:28:28.183

4

Try to normalize the string and then ASCII encode it ignoring errors.

# -*- coding: utf-8 -*-
from unicodedata import normalize

string = 'úäô§'

if isinstance(string, str):
    string = string.decode('utf-8')

print normalize('NFKD', string).encode('ASCII', 'ignore')
>>> uao

edited Oct 21 '15 at 16:28

answered Oct 21 '15 at 16:21

Dušan Maďar

9,269
5
49
64

I think your solution is the best, because my solution does wierd things to 16 bit encoded letters, where yours behave slightly more sane – jcr Oct 21 '15 at 16:52

score -2 · Answer 3 · edited Oct 21 '15 at 16:47

-2

To remove non ASCII characters from text.

import string

text = [word for word in text if word not in string.ascii_letters]

edited Oct 21 '15 at 16:47

Dušan Maďar

9,269
5
49
64

answered Oct 21 '15 at 15:51

LetzerWille

5,355
4
23
26

This throws errors, cause i can't write nonAscii char into the Python. – Poggio Oct 21 '15 at 16:07
@Poggio you can't run this list comprehension? what are the errors that you a getting ? – LetzerWille Oct 21 '15 at 16:13

How to remove nonAscii characters in python

3 Answers3

Linked