Preserve Line Breaks - Simple HTML DOM Parser

Question

When using PHP Simple HTML DOM Parser, is it normal that line breaks
tags are stripped out?

Use the built in dom parser, not simple html dom. The built in parser is an order of magnitude faster. http://whitlock.ath.cx/FastCrawl/benchmark.php — Byron Whitlock, Jan 27 '11 at 04:29
Excuse me, @ByronWhitlock, but I do not use Simple HTML DOM Parser for speed, I use it to do tons of things I simply cannot do with the DOMDocument, and it's so much easier! But, OH< what I'd do for a PHP Extension version of Simple HTML DOM Parser! — Theodore R. Smith, Jul 06 '12 at 18:02

Steve · Answer 1 · 2015-04-23T04:22:38.130

60

I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.

The PHP Simple HTML Dom Parser's load function supports multiple useful parameters:

load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)

When calling the load function, simply pass false as the third parameter.

$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);

If using file_get_html, it's the ninth parameter.

file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)

Edit: For str_get_html, it's the fifth parameter (Thanks yitwail)

str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

edited Apr 23 '15 at 04:22

answered Feb 22 '12 at 02:44

Steve

1,480
14
22

2

Thank you.. Very helpful. Is it just me or are these parameters undocumented because for the life of me I could not find any official word on how to do this until I stumbled across this? – userabuser May 04 '12 at 00:20
1

Glad I could help. I didn't find any documentation on it either. I was actually going to mod the library to add this functionality myself when I stumbled on this. – Steve May 19 '12 at 01:39
@userabuser Completely undocumented ;-/ – Theodore R. Smith Jul 06 '12 at 18:05
2

it's the 5th parameter for `str_get_html` :) I wonder why this isn't the default; stripping line breaks is very bad for javascript single line comments – yitwail Dec 04 '13 at 04:41

score 21 · Answer 2 · edited Sep 30 '11 at 12:31

21

Was struggling with this as well, since I needed the HTML to be easily editable after processing.

Apparently there's a boolean in the SimpleHTMLDOM script $stripRN, that's set to true on default. It strips the \r, \n or \r\n tags in the HTML.

Set the var to false (several occurences in the script..) and your problem is solved.

edited Sep 30 '11 at 12:31

antyrat

27,479
9
75
76

answered Sep 29 '11 at 13:49

tomhermans

310
2
8

3

I **really** wish this was documented on their website. Cheers, mate! – Theodore R. Smith Jul 06 '12 at 18:02
1

Check out Hiteklife answer, it appears to be a built-in (yet undocumented) functionality. – Niki Romagnoli Apr 07 '16 at 14:54

score 2 · Answer 3 · answered Nov 15 '11 at 23:27

You don't have to change all $stripRN to false, the only one that affects this behavior is at line 816 ``:

// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {

Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:

if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

Adam · Answer 4 · 2014-03-11T14:29:49.370

If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(

I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\n). This is what I did....

// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.


// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);

// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;

$myContent = $dom->saveHTML();

// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);

// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);

It's important to note that I know, without a shadow of a doubt that my input contained only \n. You may want your own variations if \r\n or \t needs to be accounted for. eg slash.T or slash.RN etc

score -2 · Answer 5 · answered Mar 25 '12 at 11:33

Another option should one wish to preserve other formatting such as paragraphs & headings is to use innertext rather than plaintext then perform your own string cleaning with the result.

I realise there is a performance hit but it does allow for more granular control.

Preserve Line Breaks - Simple HTML DOM Parser

5 Answers5

Linked