0

I have inside an HTML page some lines like this :

<div>
    <p class="match"> this sentence should match </p> 
    some text
    <a class="a"> some text </a>  
</div>
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text
    <a class ="b"> some text </a> 
</div>

I want to extract the lines inside the <p class="match"> but only when there are inside div containing <a class="a">.

What I've done so far is below (I firstly find the paragraphs with <a class="a"> inside and I iterate on the result to find the sentence inside a <p class="match">) :

import re
file_to_r = open("a")

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)

regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
    print(regex_match.findall(m))

but I wonder if there is an other (still efficient) way to do it at once?

Simon
  • 6,025
  • 7
  • 46
  • 98

3 Answers3

3

Use an HTML Parser, like BeautifulSoup.

Find the a tag with a class and then find previous sibling - p tag with class match:

from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

Prints:

this sentence should match 

Also see why you should avoid using regex for parsing HTML here:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @user3683807 please read the linked thread carefully - html parsers are being made specifically for parsing HTML - specific tools for a particular task. I would recommend use `BeautifulSoup` here - it makes HTML parsing easy and reliable. – alecxe Aug 28 '14 at 17:19
1

You should use a html parser but if you still wat a regex you can use something like this:

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

Working demo

enter image description here

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • @Jerry as I suggested in my answer I wouldn't use a regex to parse html. But I posted the answer as an option to reply the question using a regex. – Federico Piazza Aug 28 '14 at 17:30
1
 <div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

You can use this.

See demo.

http://regex101.com/r/lK9iD2/7

vks
  • 67,027
  • 10
  • 91
  • 124