Regex for capturing everything between curly braces in HTML source

Question

I looked around at similar questions but, but unfortunately arrived at no solutions.

I am currently trying to classify websites based on content, and to do that I am getting their HTML source and performing some kind of document/keyword classification on it.

Right now, I'm replacing a lot of stopwords, but I want to exclude things like function declarations in the HTML source as well. So for example:

function(){
        ... // function definition
}

I want to get rid of everything between the braces so it's just an empty declaration. I'm doing this in Python with the regex library, and tried the following:

htmlSource = re.sub('/\{([^}]+)\}/', '', htmlSource)

Unfortunately, this only seems to get rid of something that is directly surrounded by curly braces, as opposed to being enclosed by it.

I'm guessing that the regex also needs to account for an arbitrary number of whitespace and newlines as well, but I'm pretty inexperienced when it comes to regex.

Could anyone help?

By the way, I'm currently reading the html content using urllib2.urlopen().response() : if there is a better way of getting it (without non-visible JS function declarations and such), I would greatly appreciate that as well.

If you've "looked around at similar questions", I can't see how you can have failed to notice that they all say that parsing HTML with regex is a bad idea. — Daniel Roseman, Dec 03 '14 at 15:47
By that statement I meant that I had looked at regex questions which aimed to do something similar, regardless of their reasoning. Secondly, I am not storing the source in a database. Thirdly, why wouldn't I want to use something simple like `re.sub('<[^<]+?>', '', htmlSource)` to get rid of tags without needing any libraries? — filpa, Dec 03 '14 at 15:51
@DanielRoseman Nevermind my reply, I just read up on what you mean. Looks like I'll have to go the library route. — filpa, Dec 03 '14 at 15:55
@sln I'm not sure I understand your question. I'm using Python to extract the content from a given URL - the websites are not administered by me, I'm merely trying to perform document classification on their content. — filpa, Dec 03 '14 at 16:04
Don't they have robot crawlers to try to hack into somebody's website. I know they hacked mine. It's hard to tell what it is you are doing. — , Dec 03 '14 at 16:18

score 1 · Accepted Answer · edited May 23 '17 at 12:21

1

Use an HTML Parser to skip script tags.

For example, using BeautifulSoup you can extract() all script tags:

from bs4 import BeautifulSoup

data = """
<body>
    <p>Some text</p>
    <script>
        function(){
            ... // function definition
        }
    </script>
    <div>More text here</div>
</body>
"""

soup = BeautifulSoup(data)
for script in soup.find_all('script'):
    script.extract()

print soup.text

Prints:

Some text
More text here

And, to follow the tradition of html + regex posts, here is the relevant thread that explains why you should avoid using regular expressions for parsing things like HTML data:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:21

Community

1
1

answered Dec 03 '14 at 15:49

alecxe

462,703
120
1,088
1,195

Would this work for all variations of `script` tags? For example, ` – filpa Dec 03 '14 at 15:54
@user991710 yup, it will. – alecxe Dec 03 '14 at 15:54
This worked out quite well. I'll try to avoid using regex for anything non-trivial related to HTML in the future. Thanks! – filpa Dec 03 '14 at 16:07

Regex for capturing everything between curly braces in HTML source

1 Answers1