7

I need to extract a HTML-Substring with JS which is position dependent. I store special characters HTML-encoded.

For example:

HTML

<div id="test"><p>l&ouml;sen &amp; gr&uuml;&szlig;en</p></div>​

Text

lösen & grüßen

My problem lies in the JS-part, for example when I try to extract the fragment , which has the HTML-dependent starting position of 3 and the end position of 9 inside the <div> block. JS seems to convert some special characters internally so that the count from 3 to 9 is wrongly interpreted as "lösen " and not "l&ouml;". Other special characters like the &amp; are not affected by this.

So my question is, if someone knows why JS is behaving in that way? Characters like &auml; or &ouml; are being converted while characters like &amp; or &nbsp; are plain. Is there any possibility to avoid this conversion?

I've set up a fiddle to demonstrate this: JSFiddle

Thanks for any help!

EDIT:

Maybe I've explained it a bit confusing, sorry for that. What I want is the HTML:

<p>l&ouml;sen &amp; gr&uuml;&szlig;en</p> .

Every special character should be unconverted, except the HTML-Tags. Like in the HTML above.

But JS converts the &ouml; or &uuml; into ö or ü automatically, what I need to avoid.

noplacetoh1de
  • 219
  • 3
  • 12
  • How are you getting the `lö` fragment? `substr()` seems to work fine: http://jsfiddle.net/66zyK/2/ – Rory McCrossan Nov 22 '12 at 14:00
  • so you need to retrieve also the e.g: inner `

    ` tags?

    – Roko C. Buljan Nov 22 '12 at 14:06
  • I need the HTML-Tags as they are, this gives me the formatting of my text. But special characters in the text should be as I store them, HTML-encoded. – noplacetoh1de Nov 22 '12 at 14:09
  • 1
    The browser's HTML parser decodes the entities when it constructs the DOM. The original HTML is lost. When you hit `innerHTML` you get a new serialisation of the DOM to HTML. – Quentin Nov 22 '12 at 14:15
  • 2
    Why do you HTML-encode German umlauts? Unicode covers all those characters. – Šime Vidas Nov 22 '12 at 14:18
  • @Quentin: Link or it didn't happen... ;) – Robert Koritnik Nov 22 '12 at 14:18
  • 1
    @RobertKoritnik OP's original demo demonstrates this behavior. Here is a simplified version: http://jsfiddle.net/66zyK/3/ As you can see, `.innerHTML` returns the characters, not the original entities. – Šime Vidas Nov 22 '12 at 14:22
  • @noplacetoh1de Btw, those `ö` things are called "named character references", and are spec'd [here](http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#character-references). – Šime Vidas Nov 22 '12 at 14:30
  • Many thanks for your helpful informations. I still have the problem, that the ampersand seems to be still HTML-encoded. http://jsfiddle.net/66zyK/5/ – noplacetoh1de Nov 22 '12 at 14:40
  • The problem is that I need everything uniformely encoded eg. everything as "named character reference" or everything unicoded, but JS still converts characters like ampersand. – noplacetoh1de Nov 22 '12 at 14:42
  • possible duplicate of [Use javascript to get raw html code](http://stackoverflow.com/questions/3905219/use-javascript-to-get-raw-html-code) – Jukka K. Korpela Nov 22 '12 at 14:55

2 Answers2

2

That's because the browser (and not JavaScript) turns entities that don't need to be escaped in HTML into their respective Unicode characters (e.g. it skips &amp;, &lt; and &gt;).

So by the time you inspect .innerHTML, it no longer contains exactly what was in the original page source; you could reverse this process, but it involves the full map of character <-> entity pairs which is just not practical.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
0

If i understand you correctly, then try use innerHTML or .html('your html code') for jQuery on the target element

Idan Gozlan
  • 3,173
  • 3
  • 30
  • 47