Extracting the first paragraph of an article using PHP

Question

I want to extract the first paragraph of an article using RegEx and PHP. I started to write a RegEx as below:

'/<p([^>]+)>(.*)<\/p>/i'

That's doing the job but the only little bug is that while markup is minified and in a one line as below:

<p>First Paragraph</p><p>SecondParagraph</p>

It simply matches all First ParagraphSecondParagraph.
Also, I know that a paragraph could not be inside another one but I have no control on what user writes so he may do something like this and the RegEx would return unexpected result in this case as below:

<p>
    First Paragraph
    <p>SecondParagraph</p>
</p>

Now the RegEx matches First ParagraphSecondParagraph but should extract First ParagraphSecondParagraph.

score 0 · Answer 1 · edited May 23 '17 at 12:18

0

I reference the answer https://stackoverflow.com/a/1732454/268074

And suggest you use Simple HTML DOM:

http://simplehtmldom.sourceforge.net/

str_get_html($string)->find('p')->plaintext;

edited May 23 '17 at 12:18

Community

1
1

answered Jan 13 '13 at 08:36

Petah

45,477
28
157
213

This is still missing the final . – Nate Lyman Jan 13 '13 at 08:38
@Petah Using a third-party is always horrible, I'm not sure that's really worthy to use it for just a simple and one-time-usage proccess. – Omid Jan 13 '13 at 08:45
@OmidAmraei I'm not sure you appreciate the detailed intricacies of HTML/XML. Simple HTML DOM is one of the more simple libraries to use for such a case. If not, then you could use [Query Path](http://querypath.org/), or [DOMDocument](http://php.net/manual/en/class.domdocument.php). But regex is probably not the best solution. – Petah Jan 13 '13 at 09:56

Extracting the first paragraph of an article using PHP

1 Answers1