2

I have a regular expression that should work to remove all content in a file before div id="content" and including/after <div id="footer"

Live test

([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*)

I am using the re module to work with the regex in python. The code I am using in my python:

file = open(file_dir)
content = file.read()
result = re.search('([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*))', content)

I have tried using re.match as well. I am unable to return the content I want. Right now I can only get it to return everything BEFORE the div#content

Brian Edelman
  • 727
  • 4
  • 10
  • 26

3 Answers3

3

Though not advisable, you could extract your content instead of simply matching it:

import re

rx = re.compile(r'''
        .*?
        (
            <div\ id="content"
            .+?
        )
        <div\ id="footer
        ''', re.VERBOSE | re.DOTALL)

content = rx.findall(your_string_here, 1)[0]
print(content)


This yields
<div id="content" class="other">
i have this other stuff 
<div>More stuff</div>

See a demo on regex101.com. Better yet: use a parser, e.g. BeautifulSoup instead.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Agreed that it isn't advisable. However I have something like 40,000 pages to go through and I don't want it to take an eternity so my thinking is that regex would be faster than a parser. Would you agree? – Brian Edelman Jun 26 '17 at 18:06
  • @BrianEdelman: It mostly is, yes. And if you always have the same structure, it will very likely work. Bear in mind though that you might get unexpected results for e.g. comments or nested ` – Jan Jun 26 '17 at 18:07
  • 1
    Thanks for this answer. Marking it as correct since it's what I asked, though I ultimately am going for a parser. Clearly that way lies madness. – Brian Edelman Jun 26 '17 at 18:23
2

If you will permit me to comment: HTML + regex = madness. :)

HTML is often irregular and a few stray characters will derail the cleverest regex. Moreover, many web pages that appear to be HTML are actually not easily available as HTML. Meanwhile, there are several lovely products for processing websites are undergoing continuous development, amongst them BeautifulSoup, selenium, and scrapy.

>>> from io import StringIO
>>> import bs4
>>> HTML = StringIO('''\
... <body>
...     <div id="container">
...         <div id="content">
...             <span class="something_1">some words</span>
...             <a href="https://link">big one</a>
...         </div>
...     <div>
...     <div id="footer">
... </body>''')
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('div', attrs={'id': 'container'})
<div id="container">
<div id="content">
<span class="something_1">some words</span>
<a href="https://link">big one</a>
</div>
<div>
<div id="footer">
</div></div></div>
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Thanks for submitting! I ended up going with parser even if it will be a bit slower. I was able to make a similar code to above work without StringIO. What's the advantage of that? – Brian Edelman Jun 26 '17 at 18:22
  • 1
    The advantage of StringIO is only that I didn't have to create a file to offer an example. :) Also, the authors of scrapy say that their stuff is faster than BeautifulSoup's. You don't have to write a scraper to use it. – Bill Bell Jun 26 '17 at 18:28
  • Oh, and using StringIO, I could show the contents of the HTML right in the answer. – Bill Bell Jun 26 '17 at 18:29
1

This RegEx should work: https://regex101.com/r/L1zzOc/1

\<div id=\"content\"[.\s\S]*?(?=\<div id=\"footer\")

It looks like you had a typo in your original code to match and forgot a " after the first <div id="footer>.

victor
  • 1,573
  • 11
  • 23