14

Below I have an HTML tag, and use JavaScript to extract the value of the widget attribute. This code will alert <test> instead of &lt;test&gt;, so the browser automatically unescapes attribute values:

alert(document.getElementById("hau").attributes[1].value)
<div id="hau" widget="&lt;test&gt;"></div>

My questions are:

  1. Can this behavior be prevented in any way, besides doing a double escape of the attribute contents? (It would look like this: &amp;lt;test&amp;gt;)
  2. Does anyone know why the browser behaves like this? Is there any place in the HTML specs that this behavior is mentioned explicitly?
theEpsilon
  • 1,800
  • 17
  • 30
pax162
  • 4,735
  • 2
  • 22
  • 28

3 Answers3

8

1) It can be done without doing a double escape

Looks like yours is closer to htmlEncode(). If you don't mind using jQuery

alert(htmlEncode($('#hau').attr('widget')))

function htmlEncode(value){
  //create a in-memory div, set it's inner text(which jQuery automatically encodes)
  //then grab the encoded contents back out.  The div never exists on the page.
  return $('<div/>').text(value).html();
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="hau" widget="&lt;test&gt;"></div>

If you're interested in a pure vanilla js solution

alert(htmlEncode(document.getElementById("hau").attributes[1].value))
function htmlEncode( html ) {
    return document.createElement( 'a' ).appendChild( 
        document.createTextNode( html ) ).parentNode.innerHTML;
};
<div id="hau" widget="&lt;test&gt;"></div>

2) Why does the browser behave like this?

Only because of this behaviour, we are able to do a few specific things, such as including quotes inside of a pre-filled input field as shown below, which would not have been possible if the only way to insert " is by adding itself which again would require escaping with another char like \

<input type='text' value="&quot;You &apos;should&apos; see the double quotes here&quot;" />
Community
  • 1
  • 1
Saravanabalagi Ramachandran
  • 8,551
  • 11
  • 53
  • 102
2

The browser unescapes the attribute value as soon as it parses the document (mentioned here). One of the reasons might be that it would otherwise be impossible to include, for example, double quotes in your attribute value (well, technically it would if you put the value in single quotes instead, but then you wouldn't be able to include single quotes in the value).

That said, the behavior cannot be prevented, although if you really must use the value with the HTML entities being part of it, you could simply turn your special characters back into the codes (I recommend Underscore's escape for such task).

lucasnadalutti
  • 5,818
  • 1
  • 28
  • 48
1

This Post doesnt talk about "Why";
But just provides an "Workaround to convert the HtmlEntity back";

in_short

If you want to output the (whole document's) outerHtml to original text -- with HtmlEntity properly escaped...

Use the function encode_HtmlEntity_in_TagAttr below.

pre-note

It seems impossible to revert the HtmlEntity in TagAttr back to the original escaped status...
-- while your are modifying inside the actual TagAttr inside Dom

eg:

          const elt_AA = $(/* html */ `<span data-foo="Time &gt; 0">Pick_this_span</span>`)[0];
          const attr = elt_AA.getAttributeNode('data-foo');
          console.log(attr.value);                                   // Time > 0
          console.log(elt_AA.outerHTML);                             // <span data-foo="Time > 0">Pick_this_span</span>
          attr.value = _.escape(attr.value); // elt_AA.setAttribute('data-foo', _.escape(attr.value)) // same 
          console.log(attr.value);                                   // Time &gt; 0
          console.log(elt_AA.outerHTML);                             // <span data-foo="Time &amp;gt; 0">Pick_this_span</span>
          // << watch this 
          //   though it appears escaped, but no -- check the outerHtml
          //   this simply makes your attribute double escaped -- which is still not the original attibute value;
          
          // more test
          attr.value = '&gt;';
          console.log(attr.value);                                   // &gt;  
          console.log(elt_AA.outerHTML);                             // <span data-foo="&amp;gt;">Pick_this_span</span>
          attr.value = '>';
          console.log(attr.value);                                   // >
          console.log(elt_AA.outerHTML);                             // <span data-foo=">">Pick_this_span</span>   
          attr.value = '&';
          console.log(attr.value);                                   // &  
          console.log(elt_AA.outerHTML);                             // <span data-foo="&amp;">Pick_this_span</span> 
          // attr.value = '&amp;gt;';
          // console.log(attr.value);
          // console.log(elt_AA.outerHTML);

soln-workaround

  1. As said

    It seems impossible to revert the HtmlEntity in TagAttr back to the original escaped status...
    -- while your are modifying inside the actual TagAttr inside Dom

    So we dont do that, instead:

  2. we change the HtmlEntity into another thing (escape it customly) inside the TagAttr.

  3. we change the HtmlEntity "back to the original escaped status" on the hardcoded outerHtml String -- using Regex.

  • note:

    • the whole point is the logic of encode_HtmlEntity_in_TagAttr

    • the logic is stated above -- its not good -- its really just a workaround

    • this encode/escapes all HtmlEntity in TagAttr
      (-- more than just changes back to original -- if your original HtmlEntity was not escaped).

    • you cant just directly run the test code below -- you need import & setup some code before.
      code is tested, but still not sure the code I posted has bug or not (mostly due to settings in JSDOM)

code (+ test case demo):

        it('~test Tag Attribut contains Html Entity', async function () {

          // #>>> 
          const html_HtaHe_ori = /* html */ `
            <!DOCTYPE html>
            <html lang="en">
              <head>
                <title>Empty_Title</title>
              </head>
              <body>
                <div class="sect3" title="MIN &gt; 0, TIME == 0 (blocking read)"> AAA </div>
                <div class="sect3" title="MIN <>>A> &gt; &quot;TT&quot; &amp; 0'''''sssss "> BBB </div>
              </body>
            </html>
          `;

          // #>>> Browser auto unescaped HtmlEntity
          fs.writeFileSync(pathStr_tmpFile_1, html_HtaHe_ori);

          dom = await JSDOM.fromFile(pathStr_tmpFile_1, {
            contentType: 'text/html; charset="utf-8"',
          });
          const _document = dom.window.document;
          let document = _document;
          const $ = jQuery(dom.window);

          const html_HtaHe_BrowserAutoUnesc = _document.body.outerHTML; // Browser auto unescaped HtmlEntity

          // #>>> Escape HtmlEntity back
          function encode_HtmlEntity_in_TagAttr() {
            // escape // escapeCustomly_for_allUnEscapedHtmlEntity_in_HtmlTagAttr
            const idNow = Date.now();
            const idDeli = 'BAUEHTAttr';
            // $('*').each(function (i, elt) {
            const arr_elt_All = document.querySelectorAll('*');
            for (const elt of arr_elt_All) {
              for (const attr_curr of [...elt.attributes]) {
                const attr_curr_val = attr_curr.value;
                if (/&|<|>|"|'|`/g.test(attr_curr_val)) {
                  attr_curr.value = attr_curr.value.replaceAll(/&/g, idDeli + 'amp' + idNow);
                  attr_curr.value = attr_curr.value.replaceAll(/>/g, idDeli + 'gt' + idNow);
                  attr_curr.value = attr_curr.value.replaceAll(/</g, idDeli + 'lt' + idNow);
                  attr_curr.value = attr_curr.value.replaceAll(/"/g, idDeli + 'quot' + idNow);
                  attr_curr.value = attr_curr.value.replaceAll(/'/g, idDeli + 'apos' + idNow);
                  attr_curr.value = attr_curr.value.replaceAll(/`/g, idDeli + 'grave' + idNow);
                }
              }
            }
            const html_HtaHe_EscapeCustomly = document.body.outerHTML; // << chnage `_document` back to `document` -- if you are not using JSDOM

            //repeat (cuz: impossible to revert the HtmlEntity in TagAttr back to the original escaped status... -- while your are modifing inside the actual TagAttr inside Dom)

            // unescape
            const html_HtaHe_encode_HtmlEntity_in_TagAttr = html_HtaHe_EscapeCustomly.replaceAll(new RegExp(idDeli + '(?<main>amp|gt|lt|quot|apos|grave)' + idNow, 'g'), '&$<main>;');
            return html_HtaHe_encode_HtmlEntity_in_TagAttr;
          }

          const html_HtaHe_encode_HtmlEntity_in_TagAttr = encode_HtmlEntity_in_TagAttr(); // Escape HtmlEntity back

          // #>>> compare result 
          // fs.writeFileSync(pathStr_outFile_1, html_HtaHe_ori);
          // fs.writeFileSync(pathStr_outFile_1, html_HtaHe_BrowserAutoUnesc);
          // fs.writeFileSync(pathStr_outFile_2, html_HtaHe_encode_HtmlEntity_in_TagAttr);

          const html_HtaHe_BrowserAutoUnesc_Result = /* html */ `
            <body>
              <div class="sect3" title="MIN > 0, TIME == 0 (blocking read)"> AAA </div>
              <div class="sect3" title="MIN <>>A> > &quot;TT&quot; &amp; 0'''''sssss "> BBB </div>
            </body>
          `;
          expect(html_HtaHe_BrowserAutoUnesc.replaceAll(/\s{2,}/g, '')).toEqual(html_HtaHe_BrowserAutoUnesc_Result.replaceAll(/\s{2,}/g, ''));
          
          const html_HtaHe_encode_HtmlEntity_in_TagAttr_ExpectedResult = /* html */ `
            <body>
              <div class="sect3" title="MIN &gt; 0, TIME == 0 (blocking read)"> AAA </div>
              <div class="sect3" title="MIN &lt;&gt;&gt;A&gt; &gt; &quot;TT&quot; &amp; 0&apos;&apos;&apos;&apos;&apos;sssss "> BBB </div>
            </body>
          `;
          expect(html_HtaHe_encode_HtmlEntity_in_TagAttr.replaceAll(/\s{2,}/g, '')).toEqual(html_HtaHe_encode_HtmlEntity_in_TagAttr_ExpectedResult.replaceAll(/\s{2,}/g, ''));
        });
Nor.Z
  • 555
  • 1
  • 5
  • 13