2

I looked around at similar questions but, but unfortunately arrived at no solutions.

I am currently trying to classify websites based on content, and to do that I am getting their HTML source and performing some kind of document/keyword classification on it.

Right now, I'm replacing a lot of stopwords, but I want to exclude things like function declarations in the HTML source as well. So for example:

function(){
        ... // function definition
}

I want to get rid of everything between the braces so it's just an empty declaration. I'm doing this in Python with the regex library, and tried the following:

htmlSource = re.sub('/\{([^}]+)\}/', '', htmlSource)

Unfortunately, this only seems to get rid of something that is directly surrounded by curly braces, as opposed to being enclosed by it.

I'm guessing that the regex also needs to account for an arbitrary number of whitespace and newlines as well, but I'm pretty inexperienced when it comes to regex.

Could anyone help?

By the way, I'm currently reading the html content using urllib2.urlopen().response() : if there is a better way of getting it (without non-visible JS function declarations and such), I would greatly appreciate that as well.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
filpa
  • 3,651
  • 8
  • 52
  • 91
  • 1
    If you've "looked around at similar questions", I can't see how you can have failed to notice that they all say that parsing HTML with regex is a bad idea. – Daniel Roseman Dec 03 '14 at 15:47
  • By that statement I meant that I had looked at regex questions which aimed to do something similar, regardless of their reasoning. Secondly, I am not storing the source in a database. Thirdly, why wouldn't I want to use something simple like `re.sub('<[^<]+?>', '', htmlSource)` to get rid of tags without needing any libraries? – filpa Dec 03 '14 at 15:51
  • @DanielRoseman Nevermind my reply, I just read up on what you mean. Looks like I'll have to go the library route. – filpa Dec 03 '14 at 15:55
  • What languages are you parsing JS embedded in html? VB ? –  Dec 03 '14 at 16:03
  • @sln I'm not sure I understand your question. I'm using Python to extract the content from a given URL - the websites are not administered by me, I'm merely trying to perform document classification on their content. – filpa Dec 03 '14 at 16:04
  • Don't they have robot crawlers to try to hack into somebody's website. I know they hacked mine. It's hard to tell what it is you are doing. –  Dec 03 '14 at 16:18

1 Answers1

1

Use an HTML Parser to skip script tags.

For example, using BeautifulSoup you can extract() all script tags:

from bs4 import BeautifulSoup

data = """
<body>
    <p>Some text</p>
    <script>
        function(){
            ... // function definition
        }
    </script>
    <div>More text here</div>
</body>
"""

soup = BeautifulSoup(data)
for script in soup.find_all('script'):
    script.extract()

print soup.text

Prints:

Some text
More text here

And, to follow the tradition of html + regex posts, here is the relevant thread that explains why you should avoid using regular expressions for parsing things like HTML data:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195