0

I'd like to process some html code in javascript, to eliminate all extra whitespace, and convert tabs and newlines to a single space. Here's the tricky part: some whitespace is meaningful, and some isn't, and it's hard to tell programmatically which is which. Example:

<table>
 <tr>
  <td>hi</td>
 </tr>
</table>

In the above code, all whitespace and newlines can be eliminated, since having a space between a tr and a td tag is effectively meaningless (even though browsers might create a text node in there, it won't change the appearance of the page). On the other hand:

<span>following is a link</span>
 <a href="#">here it is</a>
<span>and this is text after the link</span>

Here, the whitespace between the closing span tag and the opening "a" tag (etc) is meaningful -- without it, there will be no spaces around the link.

Is there any general way to handle this? It would seem to require that the algorithm has some knowledge of html structure and different characteristics of different tags.

(note: in case you are wondering why I'm parsing html in javascript....it is for an experimental client side template builder gizmo -- long story, but please accept that I have a good reason for doing this :) )

rob
  • 9,933
  • 7
  • 42
  • 73
  • 6
    Why? Why? Why? Why? – Jakub Konecki Mar 26 '11 at 23:51
  • 1
    Because because because. Seriously though, it is a long story, but I want to allow people to edit the html in the most user friendly, pretty, indented way possible, while keeping the output both efficient and (most importantly) not screwing up the formatting. As I said, rest assured there is a good reason -- if only good to me. – rob Mar 27 '11 at 00:15
  • possible duplicate of http://stackoverflow.com/questions/1550532/trimming-whitespace-from-html-content? – William Niu Mar 27 '11 at 00:17
  • 3
    Obligatory pointer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Gareth McCaughan Mar 27 '11 at 00:21
  • I see the problems with regex on html. Since this is happening within a browser environment, I am able to work with the dom as well. (that is, walk the dom tree of the element, look for text nodes that contain only whitespace, check computed style of elements around them for being inline or what have you, change text node to empty string if appropriate, then get innerHTML) So I'm open to that sort of solution. – rob Mar 27 '11 at 22:13

2 Answers2

0

You could replace all >\s+< by ><. But this is not safe.

Imagine the following: <span>this</span> <span>text</span> would become thistext when printed. Replacing all occurances of more than one whitespace with a single withespace should be safe though:

html = html.replace(/>\s+</g,"> <"); 
morja
  • 8,297
  • 2
  • 39
  • 59
  • Yeah, the latter is what I've done. I've also considered some special casing, for instance to turn – rob Mar 27 '11 at 20:16
0

Ok, well I solved it myself so I will put my solution here. I decided to work on the DOM rather than the html as a string, and then I can grab the innerHTML as a last step. The code is slightly bulky, but the idea is:

Walk the DOM tree of the element, saving data for each node into an array (i.e. linear, not a tree). For element nodes, store both a "startelem" and "endelem" in the array, equivalent to start tags and end tags. Also take note of each element's computed "display" property (e.g. inline, block, etc), and put that in both the items in the array. (For all nodes, I also store the depth into the tree, but it doesn't appear that I need to use this).

For text nodes, take note of whether it is a regular text node, all whitespace, or an empty string.

Walk the array, and for "whitespace" text nodes, look at the previous and next item in the array. If either of them are display:inline, leave the node as a single space. If not, change the text node to be an empty string.

After that, doing an innerHTML on the element will not have the extra spaces, and, to the best I can tell, the appearance in the browser of the element will be unchanged.

Here is the code:

  var stripUnneededTextNodes= function (elem) {
   var array = [];
   addNodeAndChildrenToArray(elem, 1, array);
   for (var i=1; i<array.length-1; i++) {
      if (array[i].type == "whitespace") {
        if (array[i-1].display == "inline" && array[i+1].display == "inline") {
          array[i].node.nodeValue = ' ';
          }
        else {
          array[i].node.nodeValue = '';
          array[i].killed = true;
          }          
        delete array[i].node;
        }
      else if (array[i].type == "text") {
        var val = array[i].node.nodeValue;
        if (val.charAt(0) == ' ' && array[i-1].display != "inline") {
          array[i].node.nodeValue = val = val.substring(1);
          }
        if (val.charAt(val.length-1) == ' ' && array[i+1].display != "inline") {
          array[i].node.nodeValue = val.substring(0, val.length-1);
          }
        delete array[i].node;
        }
      }
   };

 var addNodeAndChildrenToArray = function (node, depth, array) {
  switch (node.nodeType) {
    case 1: { // ELEMENT_NODE
      var display = document.defaultView.getComputedStyle (node, null).display;
      array.push ({type: "startelem", tag: node.tagName, display: display, depth: depth});

      if (node.childNodes && node.childNodes.length != 0) {
         for (var i=0; i<node.childNodes.length; i++)
            addNodeAndChildrenToArray(node.childNodes.item(i), depth+1, array);
        }
      array.push ({type: "endelem", tag: node.tagName, display: display, depth: depth});
      }
      break;

    case 3: { //TEXT_NODE
      var newVal = node.nodeValue.replace(/\s+/g, ' ');
      node.nodeValue = newVal;
      if (newVal == ' ')
        array.push ({type: "whitespace", node: node, depth: depth});
      else if (newVal == '')
        array.push ({type: "emptytext", depth: depth});
      else
        array.push ({type: "text", node: node, display: "inline", depth: depth});
      }
      break;
    }
  };
rob
  • 9,933
  • 7
  • 42
  • 73
  • you can control what you want as indents - in your case one space – markmnl Mar 28 '11 at 00:58
  • because i need it in javascript, and the above does exactly what I need. Also I doubt html tidy could know how to do it right because it can't know whether an element is inline or not without processing all css etc. – rob Mar 28 '11 at 02:27