When using PHP Simple HTML DOM Parser, is it normal that line breaks
tags are stripped out?

- 251
- 1
- 3
- 3
-
2Use the built in dom parser, not simple html dom. The built in parser is an order of magnitude faster. http://whitlock.ath.cx/FastCrawl/benchmark.php – Byron Whitlock Jan 27 '11 at 04:29
-
4Excuse me, @ByronWhitlock, but I do not use Simple HTML DOM Parser for speed, I use it to do tons of things I simply cannot do with the DOMDocument, and it's so much easier! But, OH< what I'd do for a PHP Extension version of Simple HTML DOM Parser! – Theodore R. Smith Jul 06 '12 at 18:02
5 Answers
I know this is old, but I was looking for this as well, and realized there was actually a built in option to turn off the removal of line breaks. No need to go editing the source.
The PHP Simple HTML Dom Parser's load
function supports multiple useful parameters:
load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT)
When calling the load
function, simply pass false
as the third parameter.
$html = new simple_html_dom();
$html->load("<html><head></head><body>stuff</body></html>", true, false);
If using file_get_html
, it's the ninth parameter.
file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
Edit: For str_get_html
, it's the fifth parameter (Thanks yitwail)
str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)

- 1,480
- 14
- 22
-
2Thank you.. Very helpful. Is it just me or are these parameters undocumented because for the life of me I could not find any official word on how to do this until I stumbled across this? – userabuser May 04 '12 at 00:20
-
1Glad I could help. I didn't find any documentation on it either. I was actually going to mod the library to add this functionality myself when I stumbled on this. – Steve May 19 '12 at 01:39
-
-
2it's the 5th parameter for `str_get_html` :) I wonder why this isn't the default; stripping line breaks is very bad for javascript single line comments – yitwail Dec 04 '13 at 04:41
Was struggling with this as well, since I needed the HTML to be easily editable after processing.
Apparently there's a boolean in the SimpleHTMLDOM
script $stripRN
, that's set to true
on default. It strips the \r
, \n
or \r\n
tags in the HTML.
Set the var to false
(several occurences in the script..) and your problem is solved.

- 27,479
- 9
- 75
- 76

- 310
- 2
- 8
-
3I **really** wish this was documented on their website. Cheers, mate! – Theodore R. Smith Jul 06 '12 at 18:02
-
1Check out Hiteklife answer, it appears to be a built-in (yet undocumented) functionality. – Niki Romagnoli Apr 07 '16 at 14:54
You don't have to change all $stripRN
to false, the only one that affects this behavior is at line 816 ``:
// load html from string
function load($str, $lowercase=true, $stripRN=false, $defaultBRText=DEFAULT_BR_TEXT) {
Also consider to change line 988, because multibyte functions often are not installed on machines that do not deal with non-wester-european languages. Original line in v1.5 breaks the script immediately:
if (function_exists('mb_detect_encoding')) { $charset = mb_detect_encoding($this->root->plaintext . "ascii", $encoding_list = array( "UTF-8", "CP1252" ) ); } else $charset === false;

- 28,217
- 50
- 150
- 240
If you were passing by here wondering if you can do the same thing in DomDocument then I'm please to say you can! - but it's a bit dirty :(
I had a snippet of code I wanted to tidy but retain the exact line breaks it contained (\n). This is what I did....
// NOTE: If you're HTML isn't a full HTML document then expect DomDocument to
// start creating its own DOCTYPE, head and body tags.
// Convert \n into a pretend tag
$myContent = preg_replace("/[\n]/","<img src=\"slashN\" />",$myContent);
// Do your DOM stuff...
$dom = new DOMDocument;
$dom->loadHTML($myContent);
$dom->formatOutput = true;
$myContent = $dom->saveHTML();
// Remove the \n's that DOMDocument put in itself
$myContent = preg_replace("/[\n]/","",$myContent);
// Put my own \n's back
$myContent = preg_replace("/<img src=\"slashN\" \/>/i","\n",$myContent);
It's important to note that I know, without a shadow of a doubt that my input contained only \n. You may want your own variations if \r\n or \t needs to be accounted for. eg slash.T or slash.RN etc

- 359
- 3
- 5
Another option should one wish to preserve other formatting such as paragraphs & headings is to use innertext
rather than plaintext
then perform your own string cleaning with the result.
I realise there is a performance hit but it does allow for more granular control.

- 55
- 2
- 5