I'm trying to detect the language of a number of pages in beautifulsoup/python.
This is how I use beautiful soup to generate the text object...
soup=BeautifulSoup(content,"html.parser")
text=soup.findAll('body')[-1].text
This produces a unicode object, but I often get the following error when I run cld2 on it.
>>> cld2.detect(text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 116: ordinal not in range(128)
If I encode it, I no longer get that error for that particular text.
>>> cld2.detect(text.encode('utf-8'))
(True, 4810, (('DUTCH', 'nl', 68, 845.0), ('ENGLISH', 'en', 31, 922.0), ('Unknown', 'un', 0, 0.0)))
...but I end up getting different errors for another piece of text
>>> cld2.detect(second_text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1158-1161: ordinal not in range(128)
>>> cld2.detect(second_text.encode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 2079 (of 2605)
All of this encoding stuff completely baffles me. Is there some way to ensure that I have only valid utf-8 and unicode characters?