I am trying to parse html page and save in a database. Creating json with tags of the page.
Some of the tags include javascript like
<script type="text/javascript">RegisterSod("search.js", "");</script><script type="text/javascript" language="JavaScript" defer="defer">
<!--
function SearchEnsureSOD() { EnsureScript('search.js',typeof(GoSearch)); } _spBodyOnLoadFunctionNames.push('SearchEnsureSOD');function SB420AF5B_Submit()
.
.
.
{ document.getElementById('ctl00_region_header_region_headerLinks_helpAreaID_ctl00_ctl00_SB420AF5B_InputKeywords').value=''; }}
// -->
</script>
This is normal tag item and there is no problem with it.
{'tag': 'div', 'unqid': '.....', 'id': 'newsContent0'}
But with the javascript tag I am getting error
{'text': 'IK F uu ph---------------------', 'tag': <cyfunction Comment at 0x00000000027A79A0>, 'unqid': '.....'}
This is my code:
ac = requests.get(url)
html_text = ac.text
lx = html.fromstring(html_text)
...some parsing codes
json.dumps(items).decode('utf-8') --> where I am getting error
Error is below
Traceback (most recent call last):
File "main3.py", line 132, in <module>
PageRunner(url)
File "main3.py", line 122, in PageRunner
InsertPageTags(1, url)
File "main3.py", line 58, in InsertPageTags
parameter = (WebsiteID, Url, json.dumps(items).decode('utf-8'))
File "C:\Python27\lib\json\__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "C:\Python27\lib\json\encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <cyfunction Comment at 0x00000000029279A0> is not JSON serializable
How can I dump the html with comments or remove comments from html?