31

I have form on my page where user can type some text and submit it. Text is then sent to server (REST API on top of node.js) and saved to DB (postgres).

The problem is that some strange characters (control characters) are saved to DB occasionaly - for example escape control character (^[) or backspace control character (^H). Generally it does not break anything since those characters are invisible, so html is rendered correctly. However when I provide xml content for RSS readers, they (readers) return "Malformed XML" because of those control characters (it works after deleting them).

My question is how I can remove those characters from a string on client level (javascript) or server level (javascript/node.js)?

Rory O'Kane
  • 29,210
  • 11
  • 96
  • 131
user606521
  • 14,486
  • 30
  • 113
  • 204
  • 1
    by... just doing that? Take the string, use the string replace function to replace any illegal character (or character range) with '', and then save that instead. – Mike 'Pomax' Kamermans Nov 04 '14 at 17:39
  • Check this topic http://stackoverflow.com/questions/4374822/javascript-regexp-remove-all-special-characters – Asik Nov 04 '14 at 17:40
  • use CDATA to wrap such data – Vasiliy vvscode Vanchuk Nov 04 '14 at 17:42
  • All my string fields in RSS feed are wrapped by CDATA and this does not solve the problem - still RSS readers return "malformed XML" error. – user606521 Nov 05 '14 at 09:33
  • 1
    I don't think this covers all possible characters that would break things. For example 0x200B is a silent killer - see here http://stackoverflow.com/questions/12719859/no-visible-cause-for-unexpected-token-illegal – mike nelson Dec 20 '16 at 19:53
  • 1
    Here is a list of all space characters that could be replaced by a normal space https://www.cs.tut.fi/~jkorpela/chars/spaces.html and also notes the two invisible space chars that should be removed – mike nelson Dec 20 '16 at 19:55

2 Answers2

47

Control characters in Unicode are at codepoints U+0000 through U+001F and U+007F through U+009F. Use a RegExp to find those control characters and replace them with an empty string:

str.replace(/[\u0000-\u001F\u007F-\u009F]/g, "")

If you want to remove additional characters, add the characters to the character class inside the RegExp. For example, to remove U+200B ZERO WIDTH SPACE as well, add \u200B before the ].

Rory O'Kane
  • 29,210
  • 11
  • 96
  • 131
  • this is a 'cure' alias 'medicine' - how about "prevention"? – Bekim Bacaj Jan 28 '21 at 02:44
  • 4
    @BekimBacaj Please tell this to people who copy paste texts from Microsoft Word for instance :) – iwanuschka Jul 01 '21 at 13:37
  • I've found that iOS has an issue with autocomplete `right-to-left mark` unicode character being inserted for multilingual users hexcode > `0x200F` html code > `&rlm`; – bnns Nov 27 '22 at 11:02
  • 2
    I tried to be a little more comprehensive by using: `str.replace(/[\u0000-\u001F\u007F-\u009F\u061C\u200E\u200F\u202A-\u202E\u2066-\u2069]/g, "");` – bnns Nov 27 '22 at 11:49
-6

I had the similar problem, here's the solution which i choose.

I encoded the string data from the user using encodeURIComponent(variable_Name) and then saved then while displaying i decoded using decodeURIComponent(variable_Name)

Mateen
  • 1,631
  • 1
  • 23
  • 27
  • 4
    This does not work because `encodeURI..` just encodes control characters and `decodeURI..` descodes them back – user606521 Nov 05 '14 at 09:40
  • Thanks for your comment and can you please explain, why wouldn't encoding and decoding work? – Mateen Nov 06 '14 at 19:34
  • 3
    Because it just ENCODES invisible characters, and then DECODES them again so in fact nothing changes actually - I will have those invisible characters in my content - and I want to REMOVE them from content... – user606521 Nov 07 '14 at 09:32
  • no dude in fact encodeuricomponent method encodes almost every symbols to their html equivalent code like for example var uri = "@#$%^&*()_+-={}[]\|:;'<>?,./"; var res = encodeURIComponent(uri); outputs: %40%23%24%25%5E%26*()_%2B-%3D%7B%7D%5B%5D%7C%3A%3B'%3C%3E%3F%2C.%2F so the special symbols once encode doesn't give any problem and we can see the actual value after decoding it. – Mateen Nov 07 '14 at 17:35
  • 1
    But I want to remove those characters, not encode them (I don't want "escape" or "backspace" characters in for example blog post description). And I can't serve encoded content to RSS feed because I have some html there and `encodeURIComponent` encodes it and I see html tags in RSS feed. – user606521 Nov 09 '14 at 09:56
  • any ways i m happy you find the answer and for the above statement i would say this. If you encode them and then decode them (can also done from sever side) then the user value will be as he had provided, encoding and decoding always saved me from the overhead of the special characters. – Mateen Nov 09 '14 at 12:39