JavaScript automatically converts some special characters

Question

I need to extract a HTML-Substring with JS which is position dependent. I store special characters HTML-encoded.

For example:

HTML

<div id="test"><p>l&ouml;sen &amp; gr&uuml;&szlig;en</p></div>

Text

lösen & grüßen

My problem lies in the JS-part, for example when I try to extract the fragment lö, which has the HTML-dependent starting position of 3 and the end position of 9 inside the <div> block. JS seems to convert some special characters internally so that the count from 3 to 9 is wrongly interpreted as "lösen " and not "lö". Other special characters like the & are not affected by this.

So my question is, if someone knows why JS is behaving in that way? Characters like ä or ö are being converted while characters like & or   are plain. Is there any possibility to avoid this conversion?

I've set up a fiddle to demonstrate this: JSFiddle

Thanks for any help!

EDIT:

Maybe I've explained it a bit confusing, sorry for that. What I want is the HTML:

<p>lösen & grüßen</p> .

Every special character should be unconverted, except the HTML-Tags. Like in the HTML above.

But JS converts the ö or ü into ö or ü automatically, what I need to avoid.

How are you getting the `lö` fragment? `substr()` seems to work fine: http://jsfiddle.net/66zyK/2/ — Rory McCrossan, Nov 22 '12 at 14:00
I need the HTML-Tags as they are, this gives me the formatting of my text. But special characters in the text should be as I store them, HTML-encoded. — noplacetoh1de, Nov 22 '12 at 14:09
The browser's HTML parser decodes the entities when it constructs the DOM. The original HTML is lost. When you hit `innerHTML` you get a new serialisation of the DOM to HTML. — Quentin, Nov 22 '12 at 14:15
Why do you HTML-encode German umlauts? Unicode covers all those characters. — Šime Vidas, Nov 22 '12 at 14:18
@RobertKoritnik OP's original demo demonstrates this behavior. Here is a simplified version: http://jsfiddle.net/66zyK/3/ As you can see, `.innerHTML` returns the characters, not the original entities. — Šime Vidas, Nov 22 '12 at 14:22
@noplacetoh1de Btw, those `ö` things are called "named character references", and are spec'd [here](http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#character-references). — Šime Vidas, Nov 22 '12 at 14:30
Many thanks for your helpful informations. I still have the problem, that the ampersand seems to be still HTML-encoded. http://jsfiddle.net/66zyK/5/ — noplacetoh1de, Nov 22 '12 at 14:40
The problem is that I need everything uniformely encoded eg. everything as "named character reference" or everything unicoded, but JS still converts characters like ampersand. — noplacetoh1de, Nov 22 '12 at 14:42
possible duplicate of [Use javascript to get raw html code](http://stackoverflow.com/questions/3905219/use-javascript-to-get-raw-html-code) — Jukka K. Korpela, Nov 22 '12 at 14:55

score 2 · Accepted Answer · answered Nov 23 '12 at 17:02

That's because the browser (and not JavaScript) turns entities that don't need to be escaped in HTML into their respective Unicode characters (e.g. it skips &, < and >).

So by the time you inspect .innerHTML, it no longer contains exactly what was in the original page source; you could reverse this process, but it involves the full map of character <-> entity pairs which is just not practical.

score 0 · Answer 2 · answered Nov 23 '12 at 16:51

0

If i understand you correctly, then try use innerHTML or .html('your html code') for jQuery on the target element

answered Nov 23 '12 at 16:51

Idan Gozlan

3,173
3
30
47

JavaScript automatically converts some special characters

2 Answers2