0

want to get the HTML for a table definition by extracting from outerHTML of table, looking for the index of '> whatever <'

tryed several patterns and match() but no luck.

<!DOCTYPE html>
<html>
    <head>     
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
    </head>
    <body>
        <!-- <thead> not on same line as <table> -->
        <table  id="t1" border="1">         
            <thead>
                <tr>   <th colspan="2">1</th><th colspan="3">22 </th></tr>
                <tr>    <th>1</th><th  data-rotate>22</th><th data-rotate>333</th><th>4444</th><th>5555555</th></tr>
            </thead>
            <tr><td>aaaaaaa</td><td>bbbbbbbbb</td><td>cccccccccc</td><td>ddddd<br>ddddddd</td><td>dddddddddddd</td></tr>
        </table>
        <!-- <thead> on same line as <table> -->
        <table  id="t2" border="1" >  <thead>                  
                <tr>   <th colspan="2">1</th><th colspan="3">22 </th></tr>
                <tr>    <th>1</th><th  data-rotate>22</th><th data-rotate>333</th><th>4444</th><th>5555555</th></tr>
            </thead>
            <tr><td>aaaaaaa</td><td>bbbbbbbbb</td><td>cccccccccc</td><td>ddddd<br>ddddddd</td><td>dddddddddddd</td></tr>
        </table>
        <p>
        <div id="out1"></div>
        <p>
        <div id="out2"></div>
        <script>
            /*****************************************
             * want to get the HTML for a table definition
             * by extracting <table ...> from outer html, looking
             * for the index of '> whatever <' 
             *****************************************/
            var m, t, oh, index;
            /*****************************************
             * does not work
             *****************************************/
            t = document.getElementById('t1');
            oh = t.outerHTML;
            index = oh.search(/\> *</); // what is wrong with  regex
            document.getElementById('out1').innerHTML = htmlentity(oh.substring(0, index + 1));
            /*****************************************
             * works
             *****************************************/
            t = document.getElementById('t2');
            oh = t.outerHTML;
            index = oh.search(/\> *\</);
            document.getElementById('out2').innerHTML = htmlentity(oh.substring(0, index + 1));

            function htmlentity(value) {
                value = value.replace(/&/gi, "&amp;");
                value = value.replace(/</gi, "&lt;");
                value = value.replace(/>/gi, "&gt;");
                value = value.replace(/"/gi, "&quot;");
                value = value.replace(/'/gi, "&#039;");
                return value;
            }
        </script>
    </body>
</html>

```

The first table defintion ,'t1', does not work with my regex. The second table defintion ,'t2', does work with my regex.

The output:

enter image description here

  • 2
    What is this regex even supposed to find? And why even use a regex instead of DOM methods? – VLAZ Jun 19 '19 at 13:04
  • I just want to find the start of the first html tag after the closing '>' of the table element, the first tag is supposed to open with '<' with the index of this one I can extract the html-source for the table. Quick, dirty shaky I know ... –  Jun 19 '19 at 13:37
  • So...something like `table.nextSibling`? – VLAZ Jun 19 '19 at 13:41

2 Answers2

2

what is wrong with regex

Regular expressions are the wrong tool for parsing HTML. (Obligatory link.) They could be part of an HTML parser, but a single expression alone is not up to this task.

want to get the HTML for a table definition

I would take a much more direct approach: The table is already parsed, so just clone it, remove all text nodes from the clone, then (if you need HTML rather than just the node tree) get its outerHTML:

function extractStructure(element) {
    const clone = element.cloneNode(true);
    removeText(clone);
    return clone.outerHTML;
}
function removeText(element) {
    let child = element.firstChild;
    while (child) {
        let next = child.nextSibling;
        if (child.nodeType === 1) { // Element
            removeText(child);
        } else if (child.nodeType === 3) { // Text
            element.removeChild(child);
        }
        child = next;
    }
}

function extractStructure(element) {
    const clone = element.cloneNode(true);
    removeText(clone);
    return clone.outerHTML;
}
function removeText(element) {
    let child = element.firstChild;
    while (child) {
        let next = child.nextSibling;
        if (child.nodeType === 1) { // Element
            removeText(child);
        } else if (child.nodeType === 3) { // Text
            element.removeChild(child);
        }
        child = next;
    }
}
console.log(extractStructure(document.getElementById("t1")));
console.log(extractStructure(document.getElementById("t2")));
<table  id="t1" border="1">         
    <thead>
        <tr>   <th colspan="2">1</th><th colspan="3">22 </th></tr>
        <tr>    <th>1</th><th  data-rotate>22</th><th data-rotate>333</th><th>4444</th><th>5555555</th></tr>
    </thead>
    <tr><td>aaaaaaa</td><td>bbbbbbbbb</td><td>cccccccccc</td><td>ddddd<br>ddddddd</td><td>dddddddddddd</td></tr>
</table>
<!-- <thead> on same line as <table> -->
<table  id="t2" border="1" >  <thead>                  
        <tr>   <th colspan="2">1</th><th colspan="3">22 </th></tr>
        <tr>    <th>1</th><th  data-rotate>22</th><th data-rotate>333</th><th>4444</th><th>5555555</th></tr>
    </thead>
    <tr><td>aaaaaaa</td><td>bbbbbbbbb</td><td>cccccccccc</td><td>ddddd<br>ddddddd</td><td>dddddddddddd</td></tr>
</table>
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
0

On t1 are returning to line

<table  id="t1" border="1">         
    <thead>

And on your regex you are picking everything left after /> probably try to go greedy on it?

Try with this index = oh.search(/\>.*?/);

Code:

    const regexT = />.*?/;
    t = document.getElementById('t1');
    oh = t.outerHTML;
    index = oh.search(regexT);
    document.getElementById('out1').innerHTML = htmlentity(oh.substring(0, index + 1));
    t = document.getElementById('t2');
    oh = t.outerHTML;
    index = oh.search(regexT);
    document.getElementById('out2').innerHTML = htmlentity(oh.substring(0, index + 1));

Side note: Probably not the best approach in this case is pattern matching (See T.J. Crowder's answer)

Oussama Ben Ghorbel
  • 2,132
  • 4
  • 17
  • 34
  • Why is `>` escaped? It has no special meaning in a regex. – VLAZ Jun 19 '19 at 13:09
  • Thank this works now for me. I use outerHTML because this is the SOURCE, I know i am off when there is a '<' as part of a data attribute of style or whatever but for the moment this is good enough fro me :-) –  Jun 19 '19 at 13:22
  • If you solved your problem please don't forget to up vote and accept so people facing the same problem can use it. And don't forget to see T.J.'s comment. – Oussama Ben Ghorbel Jun 19 '19 at 13:34
  • 1
    @Heinz - *"I use outerHTML because this is the SOURCE"* No, it isn't the source. It's a new string created by going through the elements and building HTML for those elements from the DOM. That's not the same thing as the original source that created the table (if there even was original source, as opposed to it being created dynamically). – T.J. Crowder Jun 19 '19 at 13:39
  • @T.J. Crowder - I fear I don't fully understand what you are telling me. The point for me is , as long as I find, in the outerHTML, everything I have written between the opening '<' and closing '>' brackets, this is ok for me. –  Jun 19 '19 at 14:04
  • @Heinz - The difference is that what you get from `outerHTML` is not exactly what you wrote in the source. But the bigger issue here is, again, a regular experssion solution **will not work reliably**. For instance, suppose you have an attribute that contains `>`, or a text node that contains `>` (both of which are completely valid HTML). The above will fail. A proper approach using the DOM will succeed. – T.J. Crowder Jun 19 '19 at 14:46