5

I'm trying to filter out Unicode characters that aren't related to language from a string.

Here's an example of what I want:

const filt1 = "This will not be replaced: æ Ç ü"; // This will not be replaced: æ Ç ü
const filt2 = "This will be replaced: » ↕ ◄"; // This will be replaced:   

How would I go about doing this? Characters such as accented letters and Chinese characters are what I want to keep. Arrows, blocks, emoji, etc. should be filtered out.

I've found various regex filters online, but none do exactly what I want. This one works the best, but it's bulky and does not include non-accented alphanumeric characters.

((?![a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇߨøÅ寿œ ]).)*
Encode42
  • 55
  • 7
  • 2
    I don't think there's any algorithm that determines what you do and don't want to keep so the only way will be brute force to list what you want to keep in a giant string/array. You can examine code pages from dozens of languages and see if you can find any algorithm based on the character code, but unless you limit yourself to only a few languages, I doubt you're going to find an algorithmic shortcut. – jfriend00 Sep 29 '19 at 20:59
  • That was my original idea, but it looks so bulky. Easily doable as seen above, but doesn't feel efficient. – Encode42 Sep 29 '19 at 21:00
  • Did you examine all the code pages you care about and see if the characters you want to keep follow some pattern with their character code? That's the only possibility I see. But, if you're going into things like Chinese and not just romance languages, that's unlikely to work. – jfriend00 Sep 29 '19 at 21:02
  • @jfriend00 Even just including Cyrillic starts to make it a major pain, adding Chinese, Korean, Japanese, etc is going to be unmaintainable. – VLAZ Sep 29 '19 at 21:04
  • 1
    @VLAZ - Yep, that's what I thought. I think I'd go back to what the real problem is and look for a different approach. – jfriend00 Sep 29 '19 at 21:04
  • Possible duplicate of [Javascript + Unicode regexes](https://stackoverflow.com/questions/280712/javascript-unicode-regexes) – Ilmari Karonen Sep 29 '19 at 21:13

1 Answers1

4

You could try an unicode regex /[^\p{L}\s]/ugi

console.log('This will be replaced: » ↕ ◄, This will not be replaced: æ Ç ü'.replace(/[^\p{L}\s]/ugi, ''));

Unicode property escapes have been added in ES2018, the browser support is currently limited, node.js supports them from the version 10.

georg
  • 211,518
  • 52
  • 313
  • 390
baao
  • 71,625
  • 17
  • 143
  • 203
  • can you explain a bit more what the regex does or maybe a link where to read more – joyBlanks Sep 29 '19 at 21:27
  • @Edude42: I'd recommend reading the MDN page linked from the answer, but it seems to still be a work in progress. The Wikipedia page on [Unicode character properties](https://en.wikipedia.org/wiki/Unicode_character_property) might be a useful supplement for some of the missing info there. – Ilmari Karonen Oct 01 '19 at 17:31