3

What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then trying to save in the mongodb.

The Problem: When I try to save the result in the database I get the error bson.errors.invalidDocument: the document must be a valid utf-8

it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something' when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.

What I already tried I've tried identify the char encode through the header from the webpage I've tried utilize the chardet

utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.

What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'

(English isn't my first language so forgive me) EDIT: if anyone want see the source code

martineau
  • 119,623
  • 25
  • 170
  • 301
raphaeljlps
  • 55
  • 1
  • 4
  • 1
    You really want to read the [Python Unicode HOWTO](http://docs.python.org/2/howto/unicode.html) and [Joel on Software on Unicode](http://www.joelonsoftware.com/articles/Unicode.html). Without knowing what the encoding was used by the website, this is hard to answer. – Martijn Pieters Feb 25 '13 at 18:43
  • I had seen the Python Unicode Howto before, but thanks, I will look at the Joel article. – raphaeljlps Feb 25 '13 at 21:10

1 Answers1

3

It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.

from bs4 import BeautifulSoup
import urllib

url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)

# text is a Unicode string 
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')

See also the answers to this question.

Community
  • 1
  • 1
Suzana
  • 4,251
  • 2
  • 28
  • 52
  • Well, I think that this solution works, but I found the problem, the wordpunct_tokenize was separeting the chars e.g '\xe2\xc2' into '\xe2', '\xc2' – raphaeljlps Feb 27 '13 at 10:45
  • well, beautifulsoup was really awesome, and I think that I found the problem, when I was removing the accents the problem appears. – raphaeljlps Mar 06 '13 at 21:08