Adapting regex to python re module

Question

I have a regular expression that should work to remove all content in a file before div id="content" and including/after <div id="footer"

Live test

([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*)

I am using the re module to work with the regex in python. The code I am using in my python:

file = open(file_dir)
content = file.read()
result = re.search('([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*))', content)

I have tried using re.match as well. I am unable to return the content I want. Right now I can only get it to return everything BEFORE the div#content

Did you want to *remove* parts matching that regex, instead of finding and returning a part that matches that regex? — user2357112, Jun 26 '17 at 17:33
Do you want to include the `
` tags or do you want those to be removed? — victor, Jun 26 '17 at 17:33
@revo not sure what you're looking for atm. Did you see the live test link? — Brian Edelman, Jun 26 '17 at 17:39
@VictorC. to be clear I don't want ANY content before or after the div#content — Brian Edelman, Jun 26 '17 at 17:44
The live test if you click the substitution accordian feature at the bottom of the page displays the output that I want @VictorC. — Brian Edelman, Jun 26 '17 at 17:46
Brian, it doesn't. It highlights the two bits you *don't* want (which is exactly what you have expressed with your regex). — tink, Jun 26 '17 at 18:06
@tink right so the idea here, and I haven't implemented this well at this point (new to regex), is to substitute all that I have selected with an empty string. The test link has an example in their code generator but that isn't quite working for me either atm. — Brian Edelman, Jun 26 '17 at 18:08

Jan · Accepted Answer · 2017-06-26T17:55:44.133

3

Though not advisable, you could extract your content instead of simply matching it:

import re

rx = re.compile(r'''
        .*?
        (
            <div\ id="content"
            .+?
        )
        <div\ id="footer
        ''', re.VERBOSE | re.DOTALL)

content = rx.findall(your_string_here, 1)[0]
print(content)

This yields

<div id="content" class="other">
i have this other stuff 
<div>More stuff</div>

See a demo on regex101.com. Better yet: use a parser, e.g. BeautifulSoup instead.

edited Jun 26 '17 at 17:55

answered Jun 26 '17 at 17:48

Jan

42,290
8
54
79

Agreed that it isn't advisable. However I have something like 40,000 pages to go through and I don't want it to take an eternity so my thinking is that regex would be faster than a parser. Would you agree? – Brian Edelman Jun 26 '17 at 18:06
@BrianEdelman: It mostly is, yes. And if you always have the same structure, it will very likely work. Bear in mind though that you might get unexpected results for e.g. comments or nested `
` structures - regular expressions are not parsers.
– Jan Jun 26 '17 at 18:07
1

Thanks for this answer. Marking it as correct since it's what I asked, though I ultimately am going for a parser. Clearly that way lies madness. – Brian Edelman Jun 26 '17 at 18:23

score 2 · Answer 2 · answered Jun 26 '17 at 18:18

If you will permit me to comment: HTML + regex = madness. :)

HTML is often irregular and a few stray characters will derail the cleverest regex. Moreover, many web pages that appear to be HTML are actually not easily available as HTML. Meanwhile, there are several lovely products for processing websites are undergoing continuous development, amongst them BeautifulSoup, selenium, and scrapy.

>>> from io import StringIO
>>> import bs4
>>> HTML = StringIO('''\
... <body>
...     <div id="container">
...         <div id="content">
...             <span class="something_1">some words</span>
...             <a href="https://link">big one</a>
...         </div>
...     <div>
...     <div id="footer">
... </body>''')
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('div', attrs={'id': 'container'})
<div id="container">
<div id="content">
<span class="something_1">some words</span>
<a href="https://link">big one</a>
</div>
<div>
<div id="footer">
</div></div></div>

Thanks for submitting! I ended up going with parser even if it will be a bit slower. I was able to make a similar code to above work without StringIO. What's the advantage of that? — Brian Edelman, Jun 26 '17 at 18:22
The advantage of StringIO is only that I didn't have to create a file to offer an example. :) Also, the authors of scrapy say that their stuff is faster than BeautifulSoup's. You don't have to write a scraper to use it. — Bill Bell, Jun 26 '17 at 18:28
Oh, and using StringIO, I could show the contents of the HTML right in the answer. — Bill Bell, Jun 26 '17 at 18:29

score 1 · Answer 3 · answered Jun 26 '17 at 17:56

1

This RegEx should work: https://regex101.com/r/L1zzOc/1

\<div id=\"content\"[.\s\S]*?(?=\<div id=\"footer\")

It looks like you had a typo in your original code to match and forgot a " after the first <div id="footer>.

answered Jun 26 '17 at 17:56

victor

1,573
11
23

Adapting regex to python re module

3 Answers3