0

I want to extract the first paragraph of an article using RegEx and PHP. I started to write a RegEx as below:

'/<p([^>]+)>(.*)<\/p>/i'

That's doing the job but the only little bug is that while markup is minified and in a one line as below:

<p>First Paragraph</p><p>SecondParagraph</p>

It simply matches all <p>First Paragraph</p><p>SecondParagraph</p>.
Also, I know that a paragraph could not be inside another one but I have no control on what user writes so he may do something like this and the RegEx would return unexpected result in this case as below:

<p>
    First Paragraph
    <p>SecondParagraph</p>
</p>

Now the RegEx matches <p>First Paragraph<p>SecondParagraph</p> but should extract <p>First Paragraph<p>SecondParagraph</p></p>.

Omid
  • 4,575
  • 9
  • 43
  • 74

1 Answers1

0

I reference the answer https://stackoverflow.com/a/1732454/268074

And suggest you use Simple HTML DOM:

http://simplehtmldom.sourceforge.net/

str_get_html($string)->find('p')->plaintext;
Community
  • 1
  • 1
Petah
  • 45,477
  • 28
  • 157
  • 213
  • This is still missing the final . – Nate Lyman Jan 13 '13 at 08:38
  • @Petah Using a third-party is always horrible, I'm not sure that's really worthy to use it for just a simple and one-time-usage proccess. – Omid Jan 13 '13 at 08:45
  • @OmidAmraei I'm not sure you appreciate the detailed intricacies of HTML/XML. Simple HTML DOM is one of the more simple libraries to use for such a case. If not, then you could use [Query Path](http://querypath.org/), or [DOMDocument](http://php.net/manual/en/class.domdocument.php). But regex is probably not the best solution. – Petah Jan 13 '13 at 09:56