I looked around at similar questions but, but unfortunately arrived at no solutions.
I am currently trying to classify websites based on content, and to do that I am getting their HTML source and performing some kind of document/keyword classification on it.
Right now, I'm replacing a lot of stopwords, but I want to exclude things like function declarations in the HTML source as well. So for example:
function(){
... // function definition
}
I want to get rid of everything between the braces so it's just an empty declaration. I'm doing this in Python with the regex library, and tried the following:
htmlSource = re.sub('/\{([^}]+)\}/', '', htmlSource)
Unfortunately, this only seems to get rid of something that is directly surrounded by curly braces, as opposed to being enclosed by it.
I'm guessing that the regex also needs to account for an arbitrary number of whitespace and newlines as well, but I'm pretty inexperienced when it comes to regex.
Could anyone help?
By the way, I'm currently reading the html content using urllib2.urlopen().response() : if there is a better way of getting it (without non-visible JS function declarations and such), I would greatly appreciate that as well.