2

We use a CMS on our site. Many users have added HTML content into the database that is formatted weirdly. For example, putting all their HTML on a single line:

<h1>This is my title</h1><p>First paragraph</p><p>Second paragraph</p>

This renders in the browser correctly, of course. However, I am writing a script in PHP that loads up this data into a DOMDocument like so:

$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump($doc->documentElement->textContent);

This shows up as:

This is my titleFirst paragraphSecond paragraph

How can I get documentElement to return innerText, rather than textContent? I believe innerText will return a string with line breaks.

Lincoln Bergeson
  • 3,301
  • 5
  • 36
  • 53
  • 1
    You should iterate over all elements in the DomDocument and get the text item by item and insert the whitespaces manually. Have a look [here](http://stackoverflow.com/questions/191923/how-do-i-iterate-through-dom-elements-in-php) for example. DomDocument itself can not know where it should but the whitespaces. – cb0 Mar 02 '17 at 21:14

1 Answers1

1

As cb0 said:

You should iterate over all elements in the DomDocument and get the text item by item and insert the whitespaces manually. Have a look here for example. DomDocument itself can not know where it should but the whitespaces.

I wrote the following function to recursively traverse the DOMDocument object:

function get_text_from_dom($node, $text) {
  if (!is_null($node->childNodes)) {
    foreach ($node->childNodes as $node) {
      $text = get_text_from_dom($node, $text);
    }
  }
  else {
    return $text . $node->textContent . ' ';
  }
  return $text;
}

And replaced the code in the question with the following:

$doc = new DOMDocument();
$doc->loadHTML($row['body_html']);
var_dump(get_text_from_dom($doc->documentElement));

It is glorious.

Community
  • 1
  • 1
Lincoln Bergeson
  • 3,301
  • 5
  • 36
  • 53
  • Note that if there's a – zed May 15 '22 at 19:30