Trying to use regex on this html tag

Question

I'm new to python, and have been having trouble with regex. I want to use regex to only grab the pp. 53-63, and to be able to do so for multiple lines similar to this throughout a website. Can anyone help me with it.

<div class="src">
        Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
    </div>

so far, I've written it as:

    urlpage = page.read()
    outputh.write(urlpage)
    matches = re.findall(r'(<div class="src">+[\d+,\d]+\s+Search\s+Results)', urlpage)

But I know this is wrong..

Do we really need to point to the [regex answer](http://stackoverflow.com/a/1732454/104349) *again*? — Daniel Roseman, Sep 17 '16 at 19:08
@DanielRoseman: I guess so :( I have some boilerplate comments already. — Jan, Sep 17 '16 at 19:15
Why are you asking the same question again? You have been shown exactly how to do it reliably in your last question. — Padraic Cunningham, Sep 17 '16 at 19:48
@PadraicCunningham , I was just trying to look at the issue from multiple solutions. Sorry, thought it was different enough way of solving it. — Kainesplain, Sep 17 '16 at 20:14
@Kainesplain, regex is a really bad way to parse html for all the reasons listed in the answer linked to in Daniel's comment. Also your regex contains words that are not in your html tag text so it could not possibly work. If you want to parse html reliably use bs4 as per your last question. If that is not fast enough, look at *lxml*. — Padraic Cunningham, Sep 17 '16 at 23:32

Jan · Accepted Answer · 2016-09-17T19:44:41.390

Here you go:

from bs4 import BeautifulSoup
import re

data = """<div class="src">
        Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
    </div>"""
soup = BeautifulSoup(data)
rx = re.compile(r'\bpp\. \d+-\d+')
pages = [rx.search(div.strip()).group(0)
        for div in soup.find_all(string = rx)]
# ['pp. 53-63']

This uses a parser (BeautifulSoup) along with a regular expression, the difference is that you do not operate your regex on the DOM itself but let bs4 do it for you.

Trying to use regex on this html tag

1 Answers1