What regex matches invalid characters, but NOT accented characters (like the Spanish ones)? (non UTF-8 / non-ASCII)

Question

I have tried all the ones suggested here: Remove non-ascii character in string (among others) in a .replace(regex, "") JavaScript function, but they either remove characters with accents too or leave other invalid non-UTF-8 characters.

Use case: I'm fetching XML data from a Spanish client and some items contain invalid non-UTF-8 characters which make the XML not parsable (even making the GET request through Chrome gives me "Input is not proper UTF-8"), but I want to keep all characters with Spanish accents ("tildes"/"acentos")

Example of invalid symbols mixed with accented and non-accented letters (for testing):

muérdago áéíóúàèìòù F̸̡̢͓̳̜̪̟̳̠̻̖͐̂̍̅̔̂͋͂͐l̸̢̹̣̤̙͚̱͓̖̹̻̣͇͗͂̃̈͝a̸̢̡̬͕͕̰̖͍̮̪̬̍̏̎̕͘ͅv̸̢̛̠̟̄̿��

T.J. Crowder · Answer 1 · 2020-04-01T15:54:50.370

The answers to the question you linked show using negated classes. The classes list the characters to keep. So just add the other characters you want to keep to those classes. (Spanish uses relatively few diacritical marks, so it's basically ñ and a couple of accented vowels.

For instance, using this answer as a starting point:

str.replace(/[^\x00-\x7F]/g, "");

and adding ñ, á, é, í, ó, ú

str.replace(/[^\x00-\x7Fñáéíóú]/g, "");

If you have it available on your target system, you may want to normalize the string to NFC form first, so that if accents are written with combining marks (rather than the single code point for an accented letter), those get handled:

if (str.normalize) {
    str = str.normalize();
}
str.replace(/[^\x00-\x7Fñáéíóú]/g, "");

Otherwise, you might want to allow for the combining accents. That would complicate the regular expression.

Here's an exmaple of a string with a combining acute accent and what the regex above does to it without and with normalization:

if (!String.prototype.normalize) {
    console.log("This host doesn't support the normalize method");
} else {
    const str = "Buenos di\u0301as";
    console.log("string:", str);
    console.log(
        "without normalization:",
        str.replace(/[^\x00-\x7Fñáéíóú]/g, "")
    );
    console.log(
        "with normalization:   ",
        str.normalize().replace(/[^\x00-\x7Fñáéíóú]/g, "")
    );
}

Notice how in the "without" case, the combining mark was removed, and "dias" was misspelled as a result.

What regex matches invalid characters, but NOT accented characters (like the Spanish ones)? (non UTF-8 / non-ASCII)

1 Answers1