Displaying special characters, HTML entities, unicode as is in rendered HTML

Question

I am receiving an annotated json from the backend, which i need to display in UI.

The json contains strings tagged according to position and length in the content.

It may contain characters like \t \n or extra whitespaces, also html entities, unicode etc. When I try to display it in HTML, this information is lost, html entities are converted to respective values, whitespaces are reduced to single, unicode is converted to corresponding character.

I want to display the content as is, because i need to highlight the annotations and I am allowing the user to tag things as well, and if he tags them in the displayed HTML, the position and length would be different from the original json.

Example:

json:

{
"content": " \tHi there &nbsp how are you?"
}

This is displayed as "Hi there", and so if i want to highlight 'how', which is tagged at position 17, in the UI i would get it at position 10 or 11.

Also if a user wants to tag 'are', it would get tagged at 14, while the server would expect it to be tagged at 21.

EDIT:

this is what i have till now:

1) all html entities are converted as:

> --> &gt so that they get displayed as > in the rendered HTML and not >

2) \t, \r , \n are converted as:

\t --> \\t , so that it gets displayed as \t

3) i can also recognize unicode characters and convert them:

\u --> \\u , so that they get displayed as it is

but there are some other issues like, extra whitespaces, foreign characters, patterns like \x etc. i don't think i have a comprehensive list of everything, and sooner or later it might break.

I think it helps you: [link](http://stackoverflow.com/questions/4253367/how-to-escape-a-json-string-containing-newline-characters-using-javascript). — Maxim Goncharuk, Dec 02 '15 at 11:46

score 2 · Answer 1 · answered Dec 03 '15 at 15:02

2

That’s exactly what jsesc does. From the README:

jsesc is a JavaScript library for escaping JavaScript strings while generating the shortest possible valid ASCII-only output. Here’s an online demo.

Use it as follows:

var data = { "content": " \tHi there &nbsp how are you?"};
var escaped = jsesc(data.content);
// → ' \\tHi there &nbsp how are you?'

There are many options to customize the output. See the documentation for more details.

To display the jsesc output in HTML, don’t set it to an element’s .innerHTML but rather use .textContent.

answered Dec 03 '15 at 15:02

Mathias Bynens

144,855
52
216
248

Haven't looked into it yet, but another issue is that I have to highlight some words and that would require HTML, and I think textContent would not be able to display that – gaurav5430 Dec 03 '15 at 15:16
Take a look to jsesc, like @Mathias has said, it does that you need. – ElChiniNet Dec 03 '15 at 16:51
@elchininet, actually this library would escape things and convert them to unicode, like if a non printable character or a foreign language word would be converted to unicode, which would again mess up with my position and offset – gaurav5430 Dec 03 '15 at 17:38
The library has some options, if you no treat the output as a javascript string body, the library escapes all "\" and "&" characters. Test with an example string that you receive from backend. – ElChiniNet Dec 03 '15 at 17:44
yeah, but did you see how the foreign characters were converted to unicode...also, i can't set it as textContent because i need to have some html as well – gaurav5430 Dec 03 '15 at 17:50
You are right, i tested with Spanish characters and saw the result :( – ElChiniNet Dec 03 '15 at 18:04

ElChiniNet · Answer 2 · 2015-12-03T01:17:40.093

0

Try this little function, add regular expressions depending on the received characters:

function html(str){
    return str.replace(/&/g, "&amp;").replace(/\t/g, "\\t").replace(/\n/g, "\\n").replace(/\r/g, "\\r");
}

jsfiddle

edited Dec 03 '15 at 01:17

answered Dec 03 '15 at 01:10

ElChiniNet

2,778
2
19
27

This is what I am presently doing to escape/show HTML entities, but there are other things like Unicode and \t etc, and I don't know all the possibilities – gaurav5430 Dec 03 '15 at 15:18
The unicode characters are represented by "\uXXXXX", add "\u" to the regular expressions. In my solutions I added "\t", "\n" and "\r" already, there is no much more possibilities. – ElChiniNet Dec 03 '15 at 16:28
Could you provide me with an example data to see all situations, I'm sure that it could be accomplished with regular expressions replacements (The extra white spaces could be accomplished with regular expressions too) – ElChiniNet Dec 03 '15 at 18:16
the input is a set of a million documents, each sourced from diverse sources, from the web as well as docs, pdfs and what not. Initially i decided to add functions as and when i encountered a new issue, but that's actually taking a lot of effort and stalling the development. – gaurav5430 Dec 03 '15 at 18:20
Ok I understand that. But with regexps you could match many patterns, and I think that could be solved. For example, with the single replacement of "&" for "&" you manage < > & &excl; &quest; etc , with the replacement of escaped "%u" for "\\u" you solved all unicode characters, the rest will be octal characters, new lines, returns, tabs that formed with "\" at the beginning. If you provide more examples of data maybe we could arrive to a good solution, I only trying to help you to find a solution. Forgive me for my very bad english. – ElChiniNet Dec 03 '15 at 18:34
ya got your point, but i asked this question to do away with this task of finding more examples. if i find them, i would write a regex or a function to escape them correctly. the issue is, i would not need to think about writing the functions/regex, if there was an easier way to handle everything as is – gaurav5430 Dec 03 '15 at 18:40

Displaying special characters, HTML entities, unicode as is in rendered HTML

2 Answers2