I have inside an HTML page some lines like this :
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldnt match</p>
some text
<a class ="b"> some text </a>
</div>
I want to extract the lines inside the <p class="match">
but only when there are inside div
containing <a class="a">
.
What I've done so far is below (I firstly find the paragraphs with <a class="a">
inside and I iterate on the result to find the sentence inside a <p class="match">
) :
import re
file_to_r = open("a")
regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)
regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
print(regex_match.findall(m))
but I wonder if there is an other (still efficient) way to do it at once?