Filtering out all non-alphanumeric characters in JavaScript

Question

I'm trying to filter out Unicode characters that aren't related to language from a string.

Here's an example of what I want:

const filt1 = "This will not be replaced: æ Ç ü"; // This will not be replaced: æ Ç ü
const filt2 = "This will be replaced: » ↕ ◄"; // This will be replaced:

How would I go about doing this? Characters such as accented letters and Chinese characters are what I want to keep. Arrows, blocks, emoji, etc. should be filtered out.

I've found various regex filters online, but none do exactly what I want. This one works the best, but it's bulky and does not include non-accented alphanumeric characters.

((?![a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ ]).)*

I don't think there's any algorithm that determines what you do and don't want to keep so the only way will be brute force to list what you want to keep in a giant string/array. You can examine code pages from dozens of languages and see if you can find any algorithm based on the character code, but unless you limit yourself to only a few languages, I doubt you're going to find an algorithmic shortcut. — jfriend00, Sep 29 '19 at 20:59
That was my original idea, but it looks so bulky. Easily doable as seen above, but doesn't feel efficient. — Encode42, Sep 29 '19 at 21:00
Did you examine all the code pages you care about and see if the characters you want to keep follow some pattern with their character code? That's the only possibility I see. But, if you're going into things like Chinese and not just romance languages, that's unlikely to work. — jfriend00, Sep 29 '19 at 21:02
@jfriend00 Even just including Cyrillic starts to make it a major pain, adding Chinese, Korean, Japanese, etc is going to be unmaintainable. — VLAZ, Sep 29 '19 at 21:04
@VLAZ - Yep, that's what I thought. I think I'd go back to what the real problem is and look for a different approach. — jfriend00, Sep 29 '19 at 21:04
Possible duplicate of [Javascript + Unicode regexes](https://stackoverflow.com/questions/280712/javascript-unicode-regexes) — Ilmari Karonen, Sep 29 '19 at 21:13

score 4 · Accepted Answer · edited Sep 30 '19 at 06:34

4

You could try an unicode regex /[^\p{L}\s]/ugi

console.log('This will be replaced: » ↕ ◄, This will not be replaced: æ Ç ü'.replace(/[^\p{L}\s]/ugi, ''));

Unicode property escapes have been added in ES2018, the browser support is currently limited, node.js supports them from the version 10.

edited Sep 30 '19 at 06:34

georg

211,518
52
313
390

answered Sep 29 '19 at 21:25

baao

71,625
17
143
203

can you explain a bit more what the regex does or maybe a link where to read more – joyBlanks Sep 29 '19 at 21:27
@Edude42: I'd recommend reading the MDN page linked from the answer, but it seems to still be a work in progress. The Wikipedia page on [Unicode character properties](https://en.wikipedia.org/wiki/Unicode_character_property) might be a useful supplement for some of the missing info there. – Ilmari Karonen Oct 01 '19 at 17:31

Filtering out all non-alphanumeric characters in JavaScript

1 Answers1