Find a paragraph and find a string inside this paragraph with REGEX

Question

I have inside an HTML page some lines like this :

<div>
    <p class="match"> this sentence should match </p> 
    some text
    <a class="a"> some text </a>  
</div>
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text
    <a class ="b"> some text </a> 
</div>

I want to extract the lines inside the <p class="match"> but only when there are inside div containing <a class="a">.

What I've done so far is below (I firstly find the paragraphs with <a class="a"> inside and I iterate on the result to find the sentence inside a <p class="match">) :

import re
file_to_r = open("a")

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)

regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
    print(regex_match.findall(m))

but I wonder if there is an other (still efficient) way to do it at once?

Try beautiful soup 4 for parsing html files.. – Avinash Raj Aug 28 '14 at 17:04 — Avinash Raj, Aug 28 '14 at 17:04
http://stackoverflow.com/a/1732454 – carloabelli Aug 28 '14 at 17:04 — carloabelli, Aug 28 '14 at 17:04

score 3 · Answer 1 · edited May 23 '17 at 12:07

Use an HTML Parser, like BeautifulSoup.

Find the a tag with a class and then find previous sibling - p tag with class match:

from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

Prints:

this sentence should match

Also see why you should avoid using regex for parsing HTML here:

RegEx match open tags except XHTML self-contained tags

@user3683807 please read the linked thread carefully - html parsers are being made specifically for parsing HTML - specific tools for a particular task. I would recommend use `BeautifulSoup` here - it makes HTML parsing easy and reliable. — alecxe, Aug 28 '14 at 17:19

score 1 · Accepted Answer · answered Aug 28 '14 at 17:20

1

You should use a html parser but if you still wat a regex you can use something like this:

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

Working demo

enter image description here

answered Aug 28 '14 at 17:20

Federico Piazza

30,085
15
87
123

@Jerry as I suggested in my answer I wouldn't use a regex to parse html. But I posted the answer as an option to reply the question using a regex. – Federico Piazza Aug 28 '14 at 17:30

score 1 · Answer 3 · answered Aug 28 '14 at 17:34

1

 <div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

You can use this.

See demo.

http://regex101.com/r/lK9iD2/7

answered Aug 28 '14 at 17:34

vks

67,027
10
91
124

Find a paragraph and find a string inside this paragraph with REGEX

3 Answers3