9

I am having a lot of trouble learning RegExp and coming up with a good algorithm to do this. I have this string of HTML that I need to parse. Note that when I am parsing it, it is still a string object and not yet HTML on the browser as I need to parse it before it gets there. The HTML looks like this:

<html>
  <head>
    <title>Geoserver GetFeatureInfo output</title>
  </head>
  <style type="text/css">
    table.featureInfo, table.featureInfo td, table.featureInfo th {
        border:1px solid #ddd;
        border-collapse:collapse;
        margin:0;
        padding:0;
        font-size: 90%;
        padding:.2em .1em;
    }
    table.featureInfo th {
        padding:.2em .2em;
        font-weight:bold;
        background:#eee;
    }
    table.featureInfo td{
        background:#fff;
    }
    table.featureInfo tr.odd td{
        background:#eee;
    }
    table.featureInfo caption{
        text-align:left;
        font-size:100%;
        font-weight:bold;
        text-transform:uppercase;
        padding:.2em .2em;
    }
  </style>

  <body>
    <table class="featureInfo2">
    <tr>
        <th class="dataLayer" colspan="5">Tibetan Villages</th>
    </tr>
    <!-- EOF Data Layer -->
    <tr class="dataHeaders">
        <th>ID</th>
        <th>Latitude</th>
        <th>Longitude</th>
        <th>Place Name</th>
        <th>English Translation</th>
    </tr>
    <!-- EOF Data Headers -->
    <!-- Data -->
    <tr>
    <!-- Feature Info Data -->
        <td>3394</td>
        <td>29.1</td>
        <td>93.15</td>
        <td>བསྡམས་གྲོང་ཚོ།</td>
        <td>Dam Drongtso </td>
    </tr>
    <!-- EOF Feature Info Data -->
    <!-- End Data -->
    </table>
    <br/>
  </body>
</html>

and I need to get it like this:

3394,
29.1,
93.15,
བསྡམས་གྲོང་ཚོ།,
Dam Drongtso

Basically an array...even better if it matches according to its field headers and from which table they are somehow, which look like this:

Tibetan Villages

ID
Latitude
Longitude
Place Name
English Translation

Finding out JavaScript does not support wonderful mapping was a bummer and I have what I want working already. However it is VERY VERY hard coded and I'm thinking I should probably use RegExp to handle this better. Unfortunately I am having a real tough time :(. Here is my function to parse my string (very ugly IMO):

    function parseHTML(html){

    //Getting the layer name
    alert(html);
    //Lousy attempt at RegExp
    var somestring = html.replace('/m//\<html\>+\<body\>//m/',' ');
    alert(somestring);
    var startPos = html.indexOf('<th class="dataLayer" colspan="5">');
    var length = ('<th class="dataLayer" colspan="5">').length;
    var endPos = html.indexOf('</th></tr><!-- EOF Data Layer -->');
    var dataLayer = html.substring(startPos + length, endPos);

    //Getting the data headers
    startPos = html.indexOf('<tr class="dataHeaders">');
    length = ('<tr class="dataHeaders">').length;
    endPos = html.indexOf('</tr><!-- EOF Data Headers -->');
    var newString = html.substring(startPos + length, endPos);
    newString = newString.replace(/<th>/g, '');
    newString = newString.substring(0, newString.lastIndexOf('</th>'));
    var featureInfoHeaders = new Array();
    featureInfoHeaders = newString.split('</th>');

    //Getting the data
    startPos = html.indexOf('<!-- Data -->');
    length = ('<!-- Data -->').length;
    endPos = html.indexOf('<!-- End Data -->');
    newString = html.substring(startPos + length, endPos);
    newString = newString.substring(0, newString.lastIndexOf('</tr><!-- EOF Feature Info Data -->'));
    var featureInfoData = new Array();
    featureInfoData = newString.split('</tr><!-- EOF Feature Info Data -->');

    for(var s = 0; s < featureInfoData.length; s++){
        startPos = featureInfoData[s].indexOf('<!-- Feature Info Data -->');
        length = ('<!-- Feature Info Data -->').length;
        endPos = featureInfoData[s].lastIndexOf('</td>');
        featureInfoData[s] = featureInfoData[s].substring(startPos + length, endPos);
        featureInfoData[s] = featureInfoData[s].replace(/<td>/g, '');
        featureInfoData[s] = featureInfoData[s].split('</td>');
    }//end for

    alert(featureInfoData);

    //Put all the feature info in one array
    var featureInfo = new Array();
    var len = featureInfoData.length;
    for(var j = 0; j < len; j++){
        featureInfo[j] = new Object();
        featureInfo[j].id = featureInfoData[j][0];
        featureInfo[j].latitude = featureInfoData[j][1];
        featureInfo[j].longitude = featureInfoData[j][2];
        featureInfo[j].placeName = featureInfoData[j][3];
        featureInfo[j].translation = featureInfoData[j][4];
        }//end for 

    //This can be ignored for now...
        var string = redesignHTML(featureInfoHeaders, featureInfo);
        return string;

    }//end parseHTML

So as you can see if the content in that string ever changes, my code will be horribly broken. I want to avoid that as much as possible and try to write better code. I appreciate all the help and advice you can give me.

evlogii
  • 811
  • 1
  • 7
  • 17
elshae
  • 539
  • 3
  • 11
  • 30
  • 1
    If you're the one to generate the HTML on the server side you could as well generate a JSON there as well and pass it in the HTML with the content. You wouldn't have to parse anything. – Robert Koritnik Nov 22 '10 at 16:49
  • 9
    parsing HTML (or XML) with regex is almost never a good idea. – Shawn Chin Nov 22 '10 at 16:51
  • 3
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Mark Thomas Nov 22 '10 at 16:56
  • 1
    There is a golden rule on SO: DO NOT PARSE HTML WITH REGULAR EXPRESSIONS – Richard H Nov 22 '10 at 17:01
  • I am using a server that creates this string (which is HTML so that it can be rendered by the browser), but at this stage where I am parsing, the browser has not seen it yet and it is really nothing more than a string... – elshae Nov 22 '10 at 17:18
  • 2
    I repeat: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 If our hearts are pure, we can stomp out Regexp parsing of HTML in our lifetime! Or Tony will come. – Prof. Falken Apr 18 '12 at 17:25

6 Answers6

25

Do the following steps:

  1. Create a new documentFragment
  2. Put your HTML string in it
  3. Use selectors to get what you want

Why do all the parsing work - which won't work anyways, since HTML is not parsable via RegExp - when you have the best HTML parser available? (the Browser)

Community
  • 1
  • 1
Ivo Wetzel
  • 46,459
  • 16
  • 98
  • 112
12

You can use jQuery to easily traverse the DOM and create an object with the structure automatically.

var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};

$('table:has(.dataLayer)', $dom).each(function(){
    var $tbl = $(this);
    var section = $tbl.find('.dataLayer').text();
    var obj = [];
    var $structure = $tbl.find('.dataHeaders');
    var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
    var $datarows= $structure.nextAll('tr');
    $datarows.each(function(i){
        obj[i] = {};
        $(this).find('td').each(function(index,element){
            obj[i][structure[index]] = $(element).text();
        });
    });
    featureInfo[section] = obj;
});

Working Demo

The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..

The featureInfo will hold the final structure and data, and can be accessed like

alert( featureInfo['Tibetan Villages'][0]['English Translation'] );

or

alert( featureInfo['Tibetan Villages'][0].id );
Gabriele Petrioli
  • 191,379
  • 34
  • 261
  • 317
  • That is really nice code you sent me there..but what I think is getting everyone is the fact that I am showing HTML. Ideally when I am parsing this string, it is not "HTML" because the browser has not seen it yet. I tried using some DOM methods and the like before and failed. Then I realized, how can I use DOM functions if this HTML has not yet been sent to the browser...am I right or horribly confused? – elshae Nov 22 '10 at 17:42
  • will with jQuery you could do a `var dom = $(htmlstring);` and us it as context to the rest of the code by starting it as `$('table:has(.dataLayer)', dom)`. Updating answer.. – Gabriele Petrioli Nov 22 '10 at 17:46
  • Wow, this is really awesome and kind of you. I am still very new to JavaScript and there is a lot for me to learn! I will work through this code and when I get it working with my app, I'll let you know :) – elshae Nov 22 '10 at 18:01
  • @elshae, did not mention that my codes uses the jQuery framework. – Gabriele Petrioli Nov 22 '10 at 18:16
  • 1
    That I knew as I've dived in jQuery a bit as it seems to be the revolution of JavaScript :). Your code works beautifully and I really thank you for showing me what's available out there :) – elshae Nov 22 '10 at 19:38
10

The "correct" way to do it is with DOMParser. Do it like this:

var parsed=new DOMParser.parseFromString(htmlString,'text/html');

Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:

/*
 * DOMParser HTML extension
 * 2012-09-04
 * 
 * By Eli Grey, http://eligrey.com
 * Public domain.
 * NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
 */

/*! @source https://gist.github.com/1129031 */
/*global document, DOMParser*/

(function(DOMParser) {
    "use strict";

    var
      DOMParser_proto = DOMParser.prototype
    , real_parseFromString = DOMParser_proto.parseFromString
    ;

    // Firefox/Opera/IE throw errors on unsupported types
    try {
        // WebKit returns null on unsupported types
        if ((new DOMParser).parseFromString("", "text/html")) {
            // text/html parsing is natively supported
            return;
        }
    } catch (ex) {}

    DOMParser_proto.parseFromString = function(markup, type) {
        if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
            var
              doc = document.implementation.createHTMLDocument("")
            ;
                if (markup.toLowerCase().indexOf('<!doctype') > -1) {
                    doc.documentElement.innerHTML = markup;
                }
                else {
                    doc.body.innerHTML = markup;
                }
            return doc;
        } else {
            return real_parseFromString.apply(this, arguments);
        }
    };
}(DOMParser));
markasoftware
  • 12,292
  • 8
  • 41
  • 69
  • it doesn't work on ie9, SCRIPT600: Invalid target element for this operation. – Mikalai Feb 24 '16 at 08:37
  • @Mikalai Sorry, but I will not work on getting IE9 compatibility. It's used by less than 1% of people, and really is more trouble than it's worth. – markasoftware Feb 25 '16 at 03:33
  • why have you decided to use HTMlHtmlElement rathe than: var iframe= document.createElement("iframe"); iframe.innerHTML = markup; ? – Mikalai Feb 25 '16 at 10:08
  • I did not write the code in the second section of this answer, it's from the men documentation – markasoftware Feb 25 '16 at 23:50
5

Change server-side code if you can (add JSON)

If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.

You could easily put JSON in table element as a data attribute value:

<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'བསྡམས་གྲོང་ཚོ།', Translation:'Dam Drongtso'}">
    ...
</table>

Or you could add data attributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.

Robert Koritnik
  • 103,639
  • 52
  • 277
  • 404
  • I am the owner of the page, or at least I have access to the full backend. The problem is that I am using a server which generates this HTML string for me, this was not my choice. – elshae Nov 22 '10 at 17:24
  • @elshae: In other words I wanted to ask whether you have access and ability/knowledge to change server side code of the page? If you do, then I suggest you actually send JSON with the page itself. – Robert Koritnik Nov 22 '10 at 17:26
  • add data attributes to TDs Could you give me a really simple example? Does this mean `92.34`??? – elshae Nov 22 '10 at 17:32
  • I'm sorry if my answer was not clear. Theoretically I do have access to send JSON to my browser, but as the server I am using does this part of the work for me, it can be said that this is encapsulated from me. In other words, the amount of effort it would take to go into the server and re-invent the way it sends this data to the browser just doesn't seem worth it to me... – elshae Nov 22 '10 at 17:35
  • @elshae: I think adding an additional element attribute is worth the hassle, because you have data on the server in a structured object way. Generating JSON out of it is much simpler than parsing HTML. What if in some time you get to change the HTML itself? You'll have to re-develop the parser as well. Having JSON in it doesn't change any of the client-side functionality. Check the example I've added. – Robert Koritnik Nov 22 '10 at 18:02
  • I appreciate your feedback and your answer is also great, unfortunately since I am a little pressed for time and it is not a part of my project's requirement I will go with Gaby's answer below. Thank you and I hope others will take your advice into account if they have the time to dive into the code. – elshae Nov 22 '10 at 19:41
2

Use John Resig's* pure javascript html parser

See demo here

*John Resig is the creator of jQuery

Community
  • 1
  • 1
adardesign
  • 33,973
  • 15
  • 62
  • 84
0

I had a similar requirement and not being that experienced with JavaScript I let jquery handle it for me with parseHTML and using find. In my case I was looking for divs with a particular class name.

function findElementsInHtmlString(document, htmlString, query) {
    var domArray = $.parseHTML(htmlString, document),
        dom = $();

    // create the dom collection from the array
    $.each(domArray, function(i, o) {
        dom = dom.add(o);
    }

    // return a collection of elements that match the query
    return dom.find(query);
}

var elementsWithClassBuild = findElementsInHtmlString(document, htmlString, '.build');
kelceyp
  • 51
  • 1
  • 7