8

We have a special requirement in a project where we have to parse a string of HTML (from an AJAX response) client side via JavaScript only. Thats right no parsing in PHP or Java! I've been going through StackOverflow, this entire week and have yet not got an acceptable solution.

Some more details on the requirements:

  • We can use any library (preferably dojo and / or jQuery) or go native!

  • We need to parse an Entire HTML Document that we receive as a string, including the <head> and <body>.

  • We also need to serialise out the parsed DOM structures to strings at times.

  • Finally, We don't want to append the parsed DOM to the current Document. Rather, we'll send it back to the server for permanent storage.

Eg: We need something like

var dom = HTMLtoDOM('<html><head><title> This is the old title. </title></head></html>');
    dom.getElementsByTagName('title')[0].innerHTML = "This is a new Title";

With my research, these are our options:

  1. A TinyMCE Parser. Problem? We need to necessarily include an editor I think. How about for parsing HTML where we don't need an editor?

  2. John Resig's Parser. Should be our best bet. Unfortunately, the parser is crashing when the entire contents of a page is given to it!

  3. The jQuery $(htmlString) or the dojo.toDom(htmlString). Both rely on DocumentFragment and hence gobble up <head> and <body>!

EDIT: We want to serialize the HTML so we may catch certain custom HTML Commnets via RegExp. We need to give users the opportunity to edit meta tags, title tags etc hence the HTML Parser.

Oh and I feel I will be murdered in Stack Overflow even if I just hint at parsing HTML via RegExp!!!

Gaurav Ramanan
  • 3,655
  • 2
  • 21
  • 29
  • Create an IFRAME node and stuff it in there? – Jens Roland Mar 02 '12 at 21:01
  • But.. I don't understand why you want to 'parse' the already-serialized HTML string before sending it to the server. You'll have to re-serialize it back into a string to pass it back to the server anyway, right? – Jens Roland Mar 02 '12 at 21:02
  • @JensRoland We want to catch certain custom HTML comments from RegExp hence serialisation. We want to give users the ability to edit the title tags, meta tags etc hence the DOM parsing! – Gaurav Ramanan Mar 02 '12 at 21:13
  • 1
    @DreamFactory: Give them ``'s! Do not give the user a chance to do XSS or other dangerous things. They do not need to edit the document! And you should _never_ display HTML that does not come from a trusted source. Never trust the client! It's dangerous! – jwueller Mar 02 '12 at 21:22
  • can manage all the issues mentioned without ever needing to parse the whole DOM at one time – charlietfl Mar 02 '12 at 21:27
  • @elusive Normally I would agree, but we are making a CMS. How could we not display the customer's code? – Gaurav Ramanan Mar 03 '12 at 13:08
  • If you need this to work for old versions of IE, check: http://stackoverflow.com/questions/9540218/a-javascript-parser-for-dom – joeytwiddle Nov 16 '15 at 01:21

5 Answers5

11

You can leverage the current document without appending any nodes to it.

Try something like this:

function toNode(html) {
    var doc = document.createElement('html');
    doc.innerHTML = html;
    return doc;
}

var node = toNode('<html><head><title> This is the old title. </title></head></html>');

console.log(node);​

http://jsfiddle.net/6SvqA/3/

Dagg Nabbit
  • 75,346
  • 19
  • 113
  • 141
  • Now _this_ is elegant. +1! But we still have the problem that parsing the DOM is the wrong approach to the intial issue. That is not the fault of this answer, though. – jwueller Mar 02 '12 at 21:34
  • 2
    @elusive it could be for trusted users, like field agents or something, you never know. – Dagg Nabbit Mar 02 '12 at 21:41
  • @elusive the users are very much trusted! – Gaurav Ramanan Mar 03 '12 at 13:09
  • @GGG Thanks a lot for your code! I agree a very very elegant solution! Just one thing. How can i serialize back `node` to a string so I can apply some RegExes to it? – Gaurav Ramanan Mar 03 '12 at 13:15
  • @GGG Also, is it cross browser? I have doubts regarding IE7+ . The console in IE9 breaks at `doc.innerHTML`. Will be indebted if you could have a look... – Gaurav Ramanan Mar 03 '12 at 13:59
  • @DreamFactory I don't have IE on hand to check unfortunately. I made an update to the answer that might work (it still works in other browsers). If it still doesn't work, let me know and I'll take a look at it next time I'm around a windows machine. – Dagg Nabbit Mar 03 '12 at 15:07
  • @GGG I got the serialisation part using $(node).html () :-D :-D – Gaurav Ramanan Mar 04 '12 at 12:03
  • @GGG In IE `doc.innerHTML = html` gives an error. `**Could not set the innerHTML property. Invalid target element for this operation.**` – Gaurav Ramanan Mar 04 '12 at 12:06
  • @DreamFactory now that you mention it, wouldn't `$("...")` deserialize it? – Dagg Nabbit Mar 04 '12 at 12:07
  • 1
    Also try outerHTML, and don't tell anyone I told you to do that :p – Dagg Nabbit Mar 04 '12 at 12:08
  • 1
    @GGG Wanna try DOCTYPE as well, since its before . I Already tried `$(string)` (point 2). Same with outerHTML? **Could not set the outerHTML property. Invalid target element for this operation.** – Gaurav Ramanan Mar 04 '12 at 12:14
  • That is a pain. You might be better of going with elusive's iframe solution after all (or rethinking your approach to the problem). – Dagg Nabbit Mar 04 '12 at 12:51
  • @GGG @elusive I still don't get how I can use iframe to parse the HTML as it gobbles up the `` and ``. Anyway i think I found the solution here [link](http://stackoverflow.com/questions/7474710/can-i-load-an-entire-html-document-into-a-document-fragment-in-internet-explorer) Thanks GGG, your answer lead me to it! – Gaurav Ramanan Mar 04 '12 at 20:37
1

I would suggest a 2-part solution whereby you read off the tags that jQuery will not parse for you, and then pass the remainder into jQuery. If you're looking for a pure-javascript solution to parse HTML data structure, jQuery is probably your best bet as it has many built-in functions to manipulate the data. You could actually build your plugin as a jQuery plugin which could be called via: $.parser or something of that nature. If you extend jQuery with your own function to parse the data, you can also return an extended jQuery object that contains functions to read specific data elements even from the header since you can manually parse the ... information and store it in the same object.

Brian
  • 3,013
  • 19
  • 27
1

Since HTML essentially is XML you can use jquery parseXML

var dom = $.parseXML(html);

$('title', dom).text("This is a new Title");

Edit:

If you want to get it back into a string you will need to use the xml plugin, but I cannot find its original source so here it is:

/**
 * jQuery xml plugin
 * Converts XML node(s) to string 
 *
 * Copyright (c) 2009 Radim Svoboda
 * Dual licensed under the MIT (MIT-LICENSE.txt)
 * and GPL (GPL-LICENSE.txt) licenses.
 *
 * @author  Radim Svoboda, user Zzzzzz
 * @version 1.0.0
 */


/**
 * Converts XML node(s) to string using web-browser features.
 * Similar to .html() with HTML nodes 
 * This method is READ-ONLY.
 *  
 * @param all set to TRUE (1,"all",etc.) process all elements,
 * otherwise process content of the first matched element 
 *  
 * @return string obtained from XML node(s)  
 */         
jQuery.fn.xml = function(all) {

  //result to return
  var s = "";

   //Anything to process ?
   if( this.length )

    //"object" with nodes to convert to string  
   (
      ( ( typeof all != 'undefined' ) && all ) ?
      //all the nodes 
      this 
      :
      //content of the first matched element 
      jQuery(this[0]).contents()
    )
   //convert node(s) to string  
   .each(function(){
    s += window.ActiveXObject ?//==  IE browser ?
       //for IE
         this.xml
         :
         //for other browsers
         (new XMLSerializer()).serializeToString(this)
         ;
  }); 


  return    s;      

  };
d_inevitable
  • 4,381
  • 2
  • 29
  • 48
1

I do not know why anybody should need this, but I suggest you simply dump your source into an iframe. The browser can do the parsing for you. You can even run DOM queries on the result.

jwueller
  • 30,582
  • 4
  • 66
  • 70
  • Yes, I tried this! But refer to the point where we may need to serialise the DOM back into strings. How do we do that for an iframe? We are making this for a custom CMS where editable regions will be marked via custom HTML comments. – Gaurav Ramanan Mar 02 '12 at 21:10
1

If you want a full parser that isn't relying some existing thing in the browser to bootstrap your interpreter, the HTML parser in dom.js is top notch. It's entire purpose is to parse html for use in a javascript hosted DOM, so it has to cater to both the DOM specifications as well as the need to parse and use the results in js, all while not assuming any existing tools besides base JS. It works in node.js or spidermonkey's jsshell or webworkers even. https://github.com/andreasgal/dom.js

It also has the serialization part, but to do that you'll need to commit to using more than just the parser part. You can find standalone serializers though that work with any DOM like structure.