0

I am trying to parse html page and save in a database. Creating json with tags of the page.

Some of the tags include javascript like

<script type="text/javascript">RegisterSod("search.js", "");</script><script type="text/javascript" language="JavaScript" defer="defer">
<!--
function SearchEnsureSOD() { EnsureScript('search.js',typeof(GoSearch)); } _spBodyOnLoadFunctionNames.push('SearchEnsureSOD');function SB420AF5B_Submit() 
.
.
. 
{ document.getElementById('ctl00_region_header_region_headerLinks_helpAreaID_ctl00_ctl00_SB420AF5B_InputKeywords').value=''; }}
// -->
</script>

This is normal tag item and there is no problem with it.

{'tag': 'div', 'unqid': '.....', 'id': 'newsContent0'}

But with the javascript tag I am getting error

{'text': 'IK F uu ph---------------------', 'tag': <cyfunction Comment at 0x00000000027A79A0>, 'unqid': '.....'}

This is my code:

ac = requests.get(url)
html_text = ac.text
lx = html.fromstring(html_text)
...some parsing codes

json.dumps(items).decode('utf-8') --> where I am getting error

Error is below

Traceback (most recent call last):
  File "main3.py", line 132, in <module>
    PageRunner(url)
  File "main3.py", line 122, in PageRunner
    InsertPageTags(1, url)
  File "main3.py", line 58, in InsertPageTags
    parameter = (WebsiteID, Url, json.dumps(items).decode('utf-8'))
  File "C:\Python27\lib\json\__init__.py", line 244, in dumps
    return _default_encoder.encode(obj)
  File "C:\Python27\lib\json\encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "C:\Python27\lib\json\encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <cyfunction Comment at 0x00000000029279A0> is not JSON serializable

How can I dump the html with comments or remove comments from html?

serkanuz
  • 25
  • 5

2 Answers2

0

Basically, the python json decoder doesn't know what to do with the <cyfunction ...> so it raises an error with that. You will need to write a customized json decoder: https://docs.python.org/2/library/json.html#json.JSONDecoder.

Or perhaps if you know all tags are in the form of <some_text>, then you can first do a regex replace with an empty string or something where you know it will work. Taking the regex from this answer here (Remove HTML comments with Regex, in Javascript), it would be:

var COMMENT_PSEUDO_COMMENT_OR_LT_BANG = new RegExp(
'<!--[\\s\\S]*?(?:-->)?'
+ '<!---+>?'  // A comment with no body
+ '|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?'
+ '|<[?][^>]*>?',  // A pseudo-comment
'g');
Community
  • 1
  • 1
David542
  • 104,438
  • 178
  • 489
  • 842
0

Instead of using javascript like in the previous answer, you can use a function using regex in Python:

import re

def js_comment_clean(js):
    js = re.sub("<!--[\\s\\S]*?(?:-->)?","",js)
    js = re.sub("<!--[\\s\\S]*?-->?","",js)
    js = re.sub('<!---+>?','',js)
    js = re.sub("|<!(?![dD][oO][cC][tT][yY][pP][eE]|\\[CDATA\\[)[^>]*>?","",js)
    js = re.sub("|<[?][^>]*>?","",js)
    return js

So, change your original line:

html_text = ac.text

with

html_text = js_comment_clean(ac.text)