Getting all the paragraphs in a string extract

Question

I am taking a few paragraphs from a database and try to seperate the paragraphs into an array with regex and different classes..but nothing works.

I tried to do this:

   public function get_first_para(){
        $doc = new DOMDocument();
    $doc->loadHTML($this->review);
    foreach($doc->getElementsByTagName('p') as $paragraph) {
      echo $paragraph."<br/><br/><br/>";
    } 
 }

But I get this:

Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 9 in C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php on line 18

Catchable fatal error: Object of class DOMElement could not be converted to string in C:\Inetpub\vhosts\bestcamdirectory.com\httpdocs\sandbox\model\ReviewContentExtractor.php on line 20

Why do I get the message, Is there an easy way to extract all the paragraphs from a string?

UPDATE:

   public function get_first_para(){
         $pattern="/<p>(.+?)<\/p>/i";
         preg_match_all($pattern,$this->review,$matches,PREG_PATTERN_ORDER);
         return $matches;
     }

I would prefer the second way..But it doesnt work well too..

Do you specifically want DOMDocument? You mention regex at one point. The error seems to be saying the document is not valid. — Ariel, Aug 07 '12 at 06:19
See this as well: http://stackoverflow.com/questions/2702799/php-parsing-invalid-html — Ariel, Aug 07 '12 at 06:20
I prefer to use regex actually..cause I want to conserve all the html that is inside those tags — Dmitry Makovetskiyd, Aug 07 '12 at 06:28

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

4

The DOMDocument::getElementsByTagName returns a DOMNodeList object which is iterable but not an array. In the foreach the $paragraph variabl is an istance of DOMElement so simply using it as a string won't work (as the error explains).

What you want is the text content of the DOMElement, which is available trough the textContent property of those (inherited from DOMNode class):

foreach($doc->getElementsByTagName('p') as $paragraph) {
  echo $paragraph->textContent."<br/><br/><br/>"; // for text only
}

Or if you need the full content of the DOMNode you can use DOMDocument::saveHTML:

foreach($doc->getElementsByTagName('p') as $paragraph) {
    echo $doc->saveHTML($paragraph)."<br/><br/><br/>\n"; // with the <p> tag

    // without the <p>
    // if you don't need the containing <p> tag, you can iterate trough it's childs and output them
    foreach ($paragraph->childNodes as $cnode) {
         echo $doc->saveHTML($cnode); 
    }
}

As for your loadHTML error, the html input is invalid, you can suppress warnings with:

libxml_use_internal_errors(true); // before loading the html content

If you need these errors, see the libxml's error handling part of the manual.

Edit

Since you insists on regexps here's how you could go about it:

preg_match_all('!<p>(.+?)</p>!sim',$html,$matches,PREG_PATTERN_ORDER);

The pattern modifiers: m means multiline, s means the . can match line ends, i for case insensitivity.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 07 '12 at 06:22

complex857

20,425
6
51
54

That isnt good. what it does , it converts everything to string and throws errors along the way...I think i prefer regex – Dmitry Makovetskiyd Aug 07 '12 at 06:27
You can still get the errors with [libxml_get_errors](http://php.net/manual/en/function.libxml-get-errors.php), also see the [html tidy](http://tidy.sourceforge.net/) project for prettying up random html inputs, maybe proves itself useful. – complex857 Aug 07 '12 at 06:29
see my update.. I need to conserve the html elements, so textContent is no good. It doesnt scrape well. I think regex would be a better solution – Dmitry Makovetskiyd Aug 07 '12 at 06:36
1

I've added an example which will export the DOMNode's html instead of it's text content, i think this is what you wanted. Parsing html with regexpes is [generally a bad idea](http://stackoverflow.com/a/1732454/1515540). – complex857 Aug 07 '12 at 06:46
@complex857 I was waiting for a link to that :) – Ariel Aug 07 '12 at 06:47
One must love the classics :-P – complex857 Aug 07 '12 at 06:48
hmm..thanks for your answer.. I prefer to use regex..only need all the paragraphs – Dmitry Makovetskiyd Aug 07 '12 at 06:59
I've added a regexp version that should work in most sane cases, but i still think its a bad idea. – complex857 Aug 07 '12 at 14:44

Getting all the paragraphs in a string extract

1 Answers1

Edit