0

I must be overcomplicating this, but I can't figure it out for the life of me.

I have a standard html document stored as a string, and I need to get the contents of the paragraph. I'll make an example case.

$stringHTML=
"<html>

<head>
<title>Title</title>
</head>

<body>

<p>This is the first paragraph</p>
<p>This is the second</p>
<p>This is the third</p>
<p>And fourth</p>

</body>
</html>";

If I use

$regex='~(<p>)(.*)(</p>)~i';
preg_match_all($regex, $stringHTML, $newVariable); 

I won't get 4 results. Rather, I'll get 10. I get 10 because the regex matches the first <p> and first </p> as well as the first <p> and fourth </p>

How can I search between two words, and return only the results of whats between each paragraph?

3 Answers3

1

Use HTML parser like DOM or XPATH to parse HTML. Dont use Regex to parse HTML. Here is how it can be easily parsed by DOMDocument.

$doc = new \DOMDocument;
$doc->loadHTML($stringHTML);
$ps = $doc->getElementsByTagName("p");
for($i=0;$i<$ps->length; $i++){
    echo $ps->item($i)->textContent. "\n";
}

Code in action


Using this RegEx (as you said its a regex practice) you'll get 4 results.

preg_match_all("#<p>(.*)</p>#", $stringHTML, $matches);
print_r($matches[1]);

Here look around syntaxes are used. See the code in action.

Community
  • 1
  • 1
Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
0

Use .*? to get the shortest match instead of the longest match.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • And this will yield 4 results? –  Jan 01 '13 at 04:53
  • Why don't you "give it a go"! – happy coder Jan 01 '13 at 04:59
  • array(4) { [0]=> array(4) { [0]=> string(34) " This is the first paragraph " [1]=> string(25) " This is the second " [2]=> string(24) " This is the third " [3]=> string(17) " And fourth " } [1]=> array(4) { [0]=> string(3) " " [1]=> string(3) " " [2]=> string(3) " " [3]=> string(3) " " } [2]=> array(4) { [0]=> string(27) "This is the first paragraph" [1]=> string(18) "This is the second" [2]=> string(17) "This is the third" [3]=> string(10) "And fourth" } [3]=> array(4) { [0]=> string(4) " " [1]=> string(4) " " [2]=> string(4) " " [3]=> string(4) " " } } –  Jan 01 '13 at 05:09
  • ... Not what I wanted, but still easier to chop up. –  Jan 01 '13 at 05:10
  • You seem to be printing it out as HTML, so you're not seeing the tags that were matched by `(

    )` and `(

    )`.
    – Barmar Jan 01 '13 at 05:20
0

Your regex should be /<p>(.*?)<\/p>/i . It will only matches the strings between <p></p> and put it in an array.

you shouldn't do a group : (<p>)

revo
  • 47,783
  • 14
  • 74
  • 117