I'm coding a wrapper with python 3. Testing it i found a little problem with an html page encoded with utf-8
Traceback (most recent call last):
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 121,
in <module>
xpaths_extraction()
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 117, in xpaths_extraction
xpaths = get_xpaths(html_file)
File "/home/caiocesare/PycharmProjects/script/xpathExtraction.py", line 21, in get_xpaths
content = clean.clean_html(htmlFile)
File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 8, in clean_html
return clean_parsed_html(parsed_html)
File "/home/caiocesare/PycharmProjects/script/htmlCleaner.py", line 24, in clean_parsed_html
refactored_url = cleaner.clean_html(parsed_html)
File "src/lxml/html/clean.py", line 520, in lxml.html.clean.Cleaner.clean_html
File "src/lxml/html/clean.py", line 396, in lxml.html.clean.Cleaner.__call__
File "/home/caiocesare/PycharmProjects/script/venv/lib/python3.6/site-packages/lxml/html/__init__.py", line 364, in drop_tag
if self.text and isinstance(self.tag, basestring):
File "src/lxml/etree.pyx", line 1014, in lxml.etree._Element.text.__get__
File "src/lxml/apihelpers.pxi", line 670, in lxml.etree._collectText
File "src/lxml/apihelpers.pxi", line 1405, in lxml.etree.funicode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 98: invalid start byte
The funny thing is that the page is with utf-8. All my urls are with utf so for 80000 pages with utf-8 no problem when I start using the 80001 decoder problem. Why do I have this error and there is a way to solve it ?
def clean_html(url):
parsed_html = lxml.html.parse(url)
return clean_parsed_html(parsed_html)
def clean_parsed_html(parsed_html):
if parsed_html.getroot() == None:
return ""
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.kill_tags = ['head', 'script', 'header', 'href', 'footer', 'a']
refactored_url = cleaner.clean_html(parsed_html)
return lxml.html.tostring(refactored_url)