65

I need a way to identify certain strings in HTML markup. I know what the strings are, but it is possible that they could be substrings of other strings in the document. To find them, I output a special delimiter character (currently using \032). On page load, we go through the HTML and record the location of the strings, and remove the delimiter.

Unfortunately, most browsers show the delimiter character until we can find and remove them all. I'd like to avoid that if possible. Is there a character or string that will be preserved in the HTML content (so a comment wont work) but wont be visible to the user? It also needs to be something that is fairly unlikely to appear next to a string, so something like   wouldn't work either.

EDIT: Sorry, I forgot to mention that the strings will be in attributes, so any sort of tag wont work.

noah
  • 21,289
  • 17
  • 64
  • 88

4 Answers4

160

‌ - zero-width non-joiner (see http://htmlhelp.org/reference/html40/entities/special.html)

On the off chance that this already appears in your text, double it up (eg: ‌‌mytext‌‌


Edit in response to comment: works in Firefox 3. Note that you have to search for the Unicode value of the entity.

<html>
<body>
    <div id="test">
        This is a &zwnj;test
    </div>

    <script type="application/javascript">
        var myDiv = document.getElementById("test");
        var content = myDiv.innerHTML;
        var pos = content.indexOf("\u200C");
        alert(pos);
    </script>
</body>
</html>
Anon
  • 1,634
  • 1
  • 10
  • 3
  • Thanks for this... I just used it in a case where I had strings with long words with slashes joining things. I wanted to 'suggest' to the browser that it break lines at the slashes, so I inserted myString.replace("/", "/\u200c"). – Malcolm Dwyer Oct 04 '13 at 15:00
5

You could insert them into <span> elements. This will work only for in-page text (not attributes, or the like).

Otherwise, you could insert a whitespace character that your program doesn't already output as part of the HTML, like a tab character (\x09), a vertical tab (\x0b), a bare carriage return (\x0d) — without a newline beside it, ala Windows text encoding — or, just a null byte (\x00).

amphetamachine
  • 27,620
  • 12
  • 60
  • 72
  • Windows never used carriage return without a new line after it; it always uses both in succession. You're thinking of old Macs. – Michael Madsen May 11 '10 at 21:24
  • So the problem with whitespace characters is that the DOM will normalize and otherwise mess with them, so they can't be reliably found later. VTs tend to get converted to spaces in the DOM. – noah May 12 '10 at 14:10
  • @Michael Madsen - That's what I meant; as `foo\x{0d}\x{0a}bar` is the Windows-standard line formatting method and would not match `/\x0d(?[^\x0a]*)\x0d`. Kudos on recalling the old Mac encoding! Ever tried `type`-ing a file in that encoding on a Windows terminal? Prints all on one line! :-) – amphetamachine May 12 '10 at 17:48
4

The best thing that I shall like to insert, which is not visible on the browser, will be a pair of tags with some special id, like <span id="delimiter" class="Delimiter"></span>. This will not show up on the content, while this can be present in the doc. You don't need to remove them.

Kangkan
  • 15,267
  • 10
  • 70
  • 113
  • Sorry, forgot to mention that the strings appear in attributes too, so the tags will end up encoded. – noah May 11 '10 at 20:27
0

You could use left-to-right (LTR) marks. Is this for some sort of XSS testing? If so, this might be of interest: Taint support for PHP

dimo414
  • 47,227
  • 18
  • 148
  • 244
Tgr
  • 27,442
  • 12
  • 81
  • 118
  • They mark left-to-right writing direction in Unicode. They have no effect when the language is left-to-right anyway. – Tgr May 12 '10 at 16:00