-1

The question is asked multiple times, but the snippets I found didnt work well. I have less experience with regex so I hope you can help me.

I want to get paragraphs by limit. I know I am able to limit my results by preg_match_all.

I have two struggles:

  • Paragraphs are "created" by html editor, so attributes are attached sometimes
  • If it is possible, i want the <p> too, but only get the text is good too

For example:

<p>Paragraph 1</p>
<p attribute="value">Paragraph 2</p>

When I limit on one, I want only the first paragraph, but limit 2 should return paragraph 2 too, even it contains attributes.

What I tried:

function GetParagraph($content, $limitParagraph = 1)
{
    preg_match_all('~(<p>(.+?)</p>){' . (int)$limitParagraph. '}~i', $sHTML, $aMatches);
    return $aMatches[0];
}

Also regex with '~(<p(.*?)>(.+?)</p>){' . (int)$limitParagraph. '}~i' didn't work well

eL-Prova
  • 1,084
  • 11
  • 27
  • You should consider reading **[this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)** before comitting too much resources into regex-based HTML parsers. – YvesLeBorg Apr 08 '18 at 12:39
  • I am aware of it, but make it a little bit more simple to return me the two paragraphs without the tags :-) – eL-Prova Apr 08 '18 at 12:42
  • This is called parsing. Don't use Regular Expressions for parsing HTML documents. Use a DOM parser instead. – revo Apr 08 '18 at 12:45

1 Answers1

1

You do not need and should not use Regular Expressions for this kind of task. This is called HTML parsing and should be done using right tools, parsers. In PHP DOMDocument along with DOMXPath would be your choices:

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$paragraphs = $dom->getElementsByTagName('p');
foreach ($paragraphs as $i => $p) {
    // Two paragraphs only
    if ($i >= 2) break;
    echo $dom->saveHTML($p);
}

Why your RegEx doesn't work?

Because of four reasons:

  1. It doesn't include newlines after each </p>
  2. Variable that holds html content is wrong. ($sHTML instead of $content)
  3. It's not anchored to start traversing from beginning of input string only.
  4. <p> doesn't match <p attribute="value"> or some thing other than itself.

Again, this is not recommended but to answer this specifically, below regex should solve these issues:

'~^.*?(?:<p[^>]*>.+?</p>\s*){' . $limitParagraph . '}~i' 
revo
  • 47,783
  • 14
  • 74
  • 117
  • Your explaination is clear, $sHTML is wrong copy. Further your solution(s) point me to the right direction. Thx! – eL-Prova Apr 10 '18 at 19:01