0

I would like to remove withe spaces or new lines from a string that comes from a html sentence. Example: lets take the follow string

<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate">    
<li class="list-group-item active">
    <a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i> Overall</a>
</li>
<li class="list-group-item list-toggle">
    <a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a>
    <ul id="collapse-MoneyManage" class="collapse">
        <li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa  fa-level-down"></i> Big Invoice  </a></li>
        <li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa  fa-cogs"></i> Big big big

 Invoice 2  </a></li>
    </ul>
 </li>
</ul>

This is the desired result:

<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i>Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa  fa-level-down"></i>Big Invoice</a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa  fa-cogs"></i>Big big big Invoice 2</a></li></ul></li></ul>

As you can see:

  1. Only 1 line, no withe spaces or new lines between "><" if there is no string between them.
  2. I would like to have trimmed strings between "><" if there are some. Example: </i> Big Invoice </a> became </i>Big Invoice</a>.
  3. And finally

    </i> Big big big
    Invoice 2 </a></li>

became </i>Big big big Invoice 2</a></li>, no new line in the middle of the sentence and trimmed.

So far I achieved the first step. This is the regex I used (>\s+<) but I don't know how to achieve the step 2 and 3. Is it possible? Any idea?


Update: After Adam's post, this the final code:

//Put your html code here. Do not use double quotes " inside it. Instead, use single.

$str =<<<eof

      your dynamic HTML here.

eof;

$re = "/(?:\\s*([<>])\\s*|(\\s)\\s*)/im"; 
$subst = "$1$2";  
$result = preg_replace($re, $subst, $str);

//If you want to use JSON
$arrToJSON = array(
    "dataPHPtoJs"=>"yourData",
    "htmlDyn"=>"$result"    
    );  
$resultJSON= json_encode(array($arrToJSON));

This html string is clean. So you can use it trough AJAX, JSON, inside javascript, that will works.

I my case I am using inside a javascript code, no AJAX, no JSON.

var htmlDyn="<?php echo $result; ?>";
//Do what you want to do with. 
$('.someElementClass').append(htmlDyn);
IgorAlves
  • 5,086
  • 10
  • 52
  • 83
  • What language are you using? – Shafizadeh Mar 01 '16 at 20:47
  • 1
    [You cannot parse arbitrary HTML with Regex](http://stackoverflow.com/a/1732454/222364). Your code probably *incorrectly* collapses `` to ``. You need to find a proper HTML parser for whatever language you're working with. There is probably already something that does what you want, called an *HTML Minifier*. (And it seems rendering my examples is collapsing the spaces too... There should by 5 spaces in the first one) – Darth Android Mar 01 '16 at 20:50

3 Answers3

2

Here is the solution:

(?:\s*([<>])\s*|(\s)\s*)

Substitution:

\1\2

You can try it here: https://regex101.com/r/dL5gB5/1

Adam
  • 4,985
  • 2
  • 29
  • 61
0

Some XML conversions if you please?
The following snippet is in PHP but could easily transformed to work with i.e. Python as well.

<?php
$string = <<<EOF
<html>
<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate">    
<li class="list-group-item active">
    <a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i> Overall</a>
</li>
<li class="list-group-item list-toggle">
    <a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i> Invoice</a>
    <ul id="collapse-MoneyManage" class="collapse">
        <li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa  fa-level-down"></i> Big Invoice  </a></li>
        <li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa  fa-cogs"></i> Big big big

 Invoice 2  </a></li>
    </ul>
 </li>
</ul>
</html>
EOF;

$xml = simplexml_load_string($string);

$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
$dom->loadXML($xml->asXML());

echo $dom->saveXML();
/* output:
<html><ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"/> Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage"><i class="fa fa-money"/> Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa  fa-level-down"/> Big Invoice  </a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa  fa-cogs"/> Big big big

 Invoice 2  </a></li></ul></li></ul></html>
*/
?>

Eliminates all unnecessary whitespace and is safer then using regular expressions on HTML tags.

Community
  • 1
  • 1
Jan
  • 42,290
  • 8
  • 54
  • 79
0

This will trim the whitespaces adjacent to tags and remove newlines in the middle of content.

Find:

(?:\s*(<(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)\s*|(?:\r?\n)+)  

Replace:

$1   

Output:

<ul class="list-group sidebar-nav-v1 margin-bottom-40" id="menuHomeUserPrivate"><li class="list-group-item active"><a id="to_ProfileOverall" class="privateMenuLinkJS"><i class="fa fa-bar-chart-o"></i>Overall</a></li><li class="list-group-item list-toggle"><a data-toggle="collapse" data-parent="#menuHomeUserPrivate" href="#collapse-MoneyManage" ><i class="fa fa-money"></i>Invoice</a><ul id="collapse-MoneyManage" class="collapse"><li><a id="to_MoneyManagerFaturamentoInsert" class="privateMenuLinkJS"><i class="fa  fa-level-down"></i>Big Invoice</a></li><li><a id="to_MoneyManagerFaturamentoGerir" class="privateMenuLinkJS"><i class="fa  fa-cogs"></i>Big big big Invoice 2</a></li></ul></li></ul>

Benchmark:

Regex1:   (?:\s*(<(?:(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>)\s*|(?:\r?\n)+)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   29
Elapsed Time:    6.75 s,   6749.58 ms,   6749576 µs