0

I've already found a lot of stackoverflow questions about this topic. But I cannot find out the solution out of these questions for my problem.

I have the following html:

<p><a name="first-title"></a></p>
<h3>First Title</h3>
<h2><a href='#second'>Second Title</a></h2>
<h3>Third Title</h3>

I want to find out the <h3> prepended by </a></p>. In this case, the output should be:

<h3>First Title</h3>

So I implement the following regular expression;

preg_match_all('/(?<=<\/a><\/p>)<h3>(.+?)<\/h3>/s',$html,$data);

The above regular expression cannot output anything from the above html. But if I remove the newlines from the html, the above regular expression can correctly output my desire result.

I would not like to remove newlines from the html if possible. How should I develop regular expression to ignore the newlines from the source string?

Please, help me.

Steve.NayLinAung
  • 5,086
  • 2
  • 25
  • 49
  • Read this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. Regexes are NOT the way to parse HTML – Jojodmo Jun 28 '15 at 22:00

1 Answers1

4

Here comes the use of \K, since you can't use qunatifiers inside the lookaround assertions.

preg_match_all('/<\/a><\/p>\s*\K<h3>(.+?)<\/h3>/s',$html,$data);

or just put \n char inside the lookbehind.

preg_match_all('/(?<=<\/a><\/p>\n)<h3>(.+?)<\/h3>/s',$html,$data);
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274