1

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

jcr
  • 1,015
  • 6
  • 18
Poggio
  • 131
  • 3
  • 9
  • the script you wrote does not throw any errors, what is the problem with non ascii letters?, do you now want it in the file you are writing? – jcr Oct 21 '15 at 16:04
  • I know there are no errors, but there are some characters just like "Â" into the HTML that i need to remove. – Poggio Oct 21 '15 at 16:06
  • @Poggio may be this will be of help http://stackoverflow.com/questions/17732695/how-to-return-plain-text-from-beautiful-soup-instead-of-unicode – LetzerWille Oct 21 '15 at 16:24

3 Answers3

4

characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

chr convert a integer to a character, ord converts a character to an integer.

text = ''.join((c for c in str(div) if ord(c) < 128)

this should be your final code

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
text = ''.join((c for c in str(div) if ord(c) < 128)
f.write(text)
f.close()
jcr
  • 1,015
  • 6
  • 18
  • Traceback (most recent call last): File "pppp.py", line 38, in div = ''.join((c for c in div if ord(c) < 128)) File "pppp.py", line 38, in div = ''.join((c for c in div if ord(c) < 128)) TypeError: ord() expected string of length 1, but Tag found This is the error – Poggio Oct 21 '15 at 17:25
  • there should be a str(div), to convert the div tag to a text string, I forgot that – jcr Oct 21 '15 at 18:49
  • There are some chars i need to handle in a better way, just like the stressed letters. For example: à - è - ì - ò - ù, that i need to print with the rest of the text. Do you know if there is a solution? – Poggio Oct 22 '15 at 13:42
4

Try to normalize the string and then ASCII encode it ignoring errors.

# -*- coding: utf-8 -*-
from unicodedata import normalize

string = 'úäô§'

if isinstance(string, str):
    string = string.decode('utf-8')

print normalize('NFKD', string).encode('ASCII', 'ignore')
>>> uao
Dušan Maďar
  • 9,269
  • 5
  • 49
  • 64
  • I think your solution is the best, because my solution does wierd things to 16 bit encoded letters, where yours behave slightly more sane – jcr Oct 21 '15 at 16:52
-2

To remove non ASCII characters from text.

import string

text = [word for word in text if word not in string.ascii_letters]
Dušan Maďar
  • 9,269
  • 5
  • 49
  • 64
LetzerWille
  • 5,355
  • 4
  • 23
  • 26