0

I'm trying to detect the language of a number of pages in beautifulsoup/python.

This is how I use beautiful soup to generate the text object...

soup=BeautifulSoup(content,"html.parser")
text=soup.findAll('body')[-1].text

This produces a unicode object, but I often get the following error when I run cld2 on it.

>>> cld2.detect(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 116: ordinal not in range(128)

If I encode it, I no longer get that error for that particular text.

>>> cld2.detect(text.encode('utf-8'))
(True, 4810, (('DUTCH', 'nl', 68, 845.0), ('ENGLISH', 'en', 31, 922.0), ('Unknown', 'un', 0, 0.0)))

...but I end up getting different errors for another piece of text

>>> cld2.detect(second_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1158-1161: ordinal not in range(128)
>>> cld2.detect(second_text.encode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 2079 (of 2605)

All of this encoding stuff completely baffles me. Is there some way to ensure that I have only valid utf-8 and unicode characters?

neelshiv
  • 6,125
  • 6
  • 21
  • 35
  • Do you know the encoding of `content`? See the `bs4` [documentation on this issue](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings). – scharfmn Aug 26 '15 at 17:04
  • Everything that I am grabbing is scraped from the web using Beautiful Soup. For that last item that had the "can't encode characters" error, the html specifies "charset='utf-8'" – neelshiv Aug 26 '15 at 17:06
  • 1
    For just that document, try passing `BeautifulSoup(content, from_encoding="utf-8")` – scharfmn Aug 26 '15 at 17:08
  • Unfortunately, that did not work. I am also scraping a very large number of documents, and I wouldn't be able to manually set the rules for each one. I would 100% be willing to strip out items that are perceived to be junk. I don't think these obscure characters will have a large impact on language detection, but I could be wrong. – neelshiv Aug 26 '15 at 17:12
  • 1
    To quote a famous podiatrist: there's no one-size-fits all in the wild. Take a look at [this](http://stackoverflow.com/a/32111908/1599229) and go from there. – scharfmn Aug 26 '15 at 17:27
  • *Encoding* into UTF-8 presumes that what you have is already Unicode. That part is certainly wrong. – tripleee Dec 17 '15 at 06:00

0 Answers0