Regex for limit paragraphs

Question

The question is asked multiple times, but the snippets I found didnt work well. I have less experience with regex so I hope you can help me.

I want to get paragraphs by limit. I know I am able to limit my results by preg_match_all.

I have two struggles:

Paragraphs are "created" by html editor, so attributes are attached sometimes
If it is possible, i want the  too, but only get the text is good too

For example:

<p>Paragraph 1</p>
<p attribute="value">Paragraph 2</p>

When I limit on one, I want only the first paragraph, but limit 2 should return paragraph 2 too, even it contains attributes.

What I tried:

function GetParagraph($content, $limitParagraph = 1)
{
    preg_match_all('~(<p>(.+?)</p>){' . (int)$limitParagraph. '}~i', $sHTML, $aMatches);
    return $aMatches[0];
}

Also regex with '~(<p(.*?)>(.+?)){' . (int)$limitParagraph. '}~i' didn't work well

You should consider reading **[this](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)** before comitting too much resources into regex-based HTML parsers. — YvesLeBorg, Apr 08 '18 at 12:39
I am aware of it, but make it a little bit more simple to return me the two paragraphs without the tags :-) — eL-Prova, Apr 08 '18 at 12:42
This is called parsing. Don't use Regular Expressions for parsing HTML documents. Use a DOM parser instead. — revo, Apr 08 '18 at 12:45

revo · Accepted Answer · 2018-04-08T13:12:14.607

You do not need and should not use Regular Expressions for this kind of task. This is called HTML parsing and should be done using right tools, parsers. In PHP DOMDocument along with DOMXPath would be your choices:

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$paragraphs = $dom->getElementsByTagName('p');
foreach ($paragraphs as $i => $p) {
    // Two paragraphs only
    if ($i >= 2) break;
    echo $dom->saveHTML($p);
}

Why your RegEx doesn't work?

Because of four reasons:

It doesn't include newlines after each 
Variable that holds html content is wrong. ($sHTML instead of $content)
It's not anchored to start traversing from beginning of input string only.
 doesn't match  or some thing other than itself.

Again, this is not recommended but to answer this specifically, below regex should solve these issues:

'~^.*?(?:<p[^>]*>.+?</p>\s*){' . $limitParagraph . '}~i'

Your explaination is clear, $sHTML is wrong copy. Further your solution(s) point me to the right direction. Thx! — eL-Prova, Apr 10 '18 at 19:01

Regex for limit paragraphs

1 Answers1

Why your RegEx doesn't work?