30

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:

strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");

It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:

strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");

Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.

I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.

Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.

halfer
  • 19,824
  • 17
  • 99
  • 186
Matthew Sielski
  • 1,107
  • 1
  • 12
  • 14
  • Missing links: link 1 - http://stackoverflow.com/questions/1401317/remove-non-uft8-characters-from-string link 2 - http://www.w3.org/International/questions/qa-forms-utf-8 – Matthew Sielski Apr 19 '10 at 19:04

8 Answers8

41

I use this simple and sturdy approach:

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).

Ali
  • 2,439
  • 23
  • 13
  • 3
    output += input.charCodeAt(i) <= 127 ? input.charAt(i) : ' ' – user40521 Jan 08 '16 at 17:46
  • One-liner with ramda: `const cleanString = input => R.map(char => char.charCodeAt(0) <= 127 ? char : '', input).join('');` – Adam McCormick May 17 '17 at 18:58
  • 1
    One-liner without ramda: `const cleanString = input => Array.of(input).map(char => char.charCodeAt(0) <= 127 ? char : '', input).join('')` – docodemore Aug 18 '17 at 17:43
  • 1
    I don't believe docodemore's version works, `Array.of(input)` returns a single element array. I think you want this: `const cleanString = input => input.split('').map(char => char.charCodeAt(0) <= 127 ? char : '').join('')` – Robin Clowers Aug 18 '17 at 23:18
  • 1
    see https://stackoverflow.com/a/57593674/1955957 for french, spanish and other "latin" languages – O'Neill Aug 21 '19 at 14:10
  • This removes all non-[*ASCII*](https://en.wikipedia.org/wiki/ASCII) characters, not invalid [UTF-8](https://en.wikipedia.org/wiki/UTF-8). For example, `"ф"` is a perfectly valid UTF-8 string, but this code snippet returns `""`. Though I'm not sure how you can remove "invalid UTF-8 characters" when JavaScript stores strings in [UTF-16](https://en.wikipedia.org/wiki/UTF-16). – Boris Verkhovskiy Jul 04 '21 at 16:59
22

JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.

(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)

You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080 ('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:

var bytelike= unescape(encodeURIComponent(characters));

and to get back from UTF-8 pseudobytes to characters again:

var characters= decodeURIComponent(escape(bytelike));

(This is, notably, pretty much the only time the escape/unescape functions should ever be used. Their existence in any other program is almost always a bug.)

decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.

It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 1
    Thanks for an informative answer -- essentially that what I'm doing is difficult because I shouldn't be doing it. I'm having trouble with certain characters on the back-end, and need to address it there. – Matthew Sielski Apr 19 '10 at 20:20
  • The string `"\uD800"` is invalid, and will cause `encodeURIComponent` to throw. – OrangeDog Jun 07 '12 at 16:57
  • @OrangeDog: yes, as there is no UTF-8 representation of that sequence of code units. – bobince Jun 08 '12 at 14:07
  • Saying that it's impossible for a javascript string to contain an invalid byte sequence is a great theory, and it is what I would expect... however, I am currently trying to fix a node issue which is caused by a string (returned from mongodb) which contains invalid UTF8 characters. Thus, it apparently is possible after all =] – taxilian Oct 21 '15 at 17:50
  • @bobince in regards to your last line, the browser does not convert the headers values set manually with setRequestHeader and will crash lamentably when given a non utf-value. Better anticipating it ;) – Sebas Jan 09 '16 at 21:49
  • 'escape' is an obsolete function, not supported in all modern browsers. – David Spector Sep 08 '19 at 17:49
15

Languages like spanish and french have accented characters like "é" and codes are in the range 160-255 see https://www.ascii.cl/htmlcodes.htm

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127 || input.charCodeAt(i) >= 160 && input.charCodeAt(i) <= 255) {
            output += input.charAt(i);
        }
    }
    return output;
}
O'Neill
  • 366
  • 2
  • 5
12

Simple mistake, big effect:

strTest = strTest.replace(/your regex here/g, "$1");
// ----------------------------------------^

without the "global" flag, the replace occurs for the first match only.

Side note: To remove any character that does not fulfill some kind of complex condition, like falling into a set of certain Unicode character ranges, you can use negative lookahead:

var re = /(?![\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})./g;
strTest = strTest.replace(re, "")

where re reads as

(?!      # negative look-ahead: a position *not followed by*:
  […]    #   any allowed character range from above
)        # end lookahead
.        # match this character (only if previous condition is met!)
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thank you, that was a big flaw in my code. Unfortunately, with the global flag now in place, both of the regular expressions I posted seem to be filtering anything that's not ASCII. The "stress test" data's first test is some valid UTF-8 text which is being stripped, and if I take sample text from http://www.columbia.edu/kermit/utf8.html everything but ASCII gets removed. – Matthew Sielski Apr 19 '10 at 19:18
10

If you're trying to remove the "invalid character" - � - from javascript strings then you can get rid of them like this:

myString = myString.replace(/\uFFFD/g, '')
Dan Mantyla
  • 1,840
  • 1
  • 22
  • 33
2

I ran into this problem with a really weird result from the Date Taken data of a digital image. My scenario is admittedly unique - using windows scripting host (wsh) and the Shell.Application activex object which allows for getting the namespace object of a folder and calling the GetDetailsOf function to essentially return exif data after it has been parsed by the OS.

var app = new ActiveXObject("Shell.Application");
var info = app.Namespace("c:\");
var date = info.GetDetailsOf(info.ParseName("testimg.jpg"), 12);

In windws vista and 7, the result looked like this:

?8/?27/?2011 ??11:45 PM

So my approach was as follows:

var chars = date.split(''); //split into characters
var clean = "";
for (var i = 0; i < chars.length; i++) {
   if (chars[i].charCodeAt(0) < 255) clean += chars[i];
}

The result of course is a string that excludes those question mark characters.

I know you went with a different solution altogether, but I thought I'd post my solution in case anyone else is having troubles with this and cannot use a server side language approach.

Marcus Pope
  • 2,293
  • 20
  • 25
0

I have put together some solutions proposed above to be error-safe

       var removeNonUtf8 = (characters) => {
            try {
                // ignore invalid char ranges
                var bytelike = unescape(encodeURIComponent(characters));
                characters = decodeURIComponent(escape(bytelike));
            } catch (error) { }
            // remove �
            characters = characters.replace(/\uFFFD/g, '');
            return characters;
        },
loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • 1
    'unescape' and 'escape' should no longer be used and may not be supported in future browsers. – David Spector Sep 08 '19 at 17:51
  • Thank you, a good answer about `escape` and `unescape` replacements is here https://stackoverflow.com/a/51175973/758836 – loretoparisi Sep 09 '19 at 09:38
  • That "good answer" only refers to mail links, not to any of the questions raised here, I believe. There are substitute functions offered on StackOverflow, but none are tested, and I can't find any tested functions on the Web as yet. Unicode really is difficult to manipulate. – David Spector Sep 09 '19 at 13:08
0

I used @Ali's solution to not only clean my string, but replace the invalid chars with html replacement:

 cleanString(input) {
    var output = "";
    for (var i = 0; i < input.length; i++) {
      if (input.charCodeAt(i) <= 127) {
        output += input.charAt(i);
      } else {
        output += "&#" + input.charCodeAt(i) + ";";
      }
    }
    return output;
  }
barbsan
  • 3,418
  • 11
  • 21
  • 28
Doran
  • 63
  • 11