Regex to remove non-letter characters but keep accented letters

Question

I have strings in Spanish and other languages that may contain generic special characters like (),*, etc. That I need to remove. But the problem is that it also may contain special language characters like ñ, á, ó, í etc and they need to remain. So I am trying to do it with regexp the following way:

var desired = stringToReplace.replace(/[^\w\s]/gi, '');

Unfortunately it is removing all special characters including the language related. Not sure how to avoid that. Maybe someone could suggest?

score 15 · Accepted Answer · edited May 23 '17 at 12:32

15

I would suggest using Steven Levithan's excellent XRegExp library and its Unicode plug-in.

Here's an example that strips non-Latin word characters from a string: http://jsfiddle.net/b3awZ/1/

var regex = XRegExp("[^\\s\\p{Latin}]+", "g");
var str = "¿Me puedes decir la contraseña de la Wi-Fi?"
var replaced = XRegExp.replace(str, regex, "");

See also this answer by Steven Levithan himself:

Regular expression Spanish and Arabic words

edited May 23 '17 at 12:32

Community

1
1

answered Oct 16 '12 at 10:32

Tim Down

318,141
75
454
536

It doesn't work with combining characters, try replacing `ñ` with `n\u0303` and you'll see that it strips the accent. – Dietrich Epp Oct 21 '12 at 16:30
@DietrichEpp: That's true. To handle those cases you could simply add `\\p{InCombiningDiacriticalMarks}` into the regex. See http://jsfiddle.net/timdown/5726F/ – Tim Down Oct 21 '12 at 22:15
@TimDown +1 for the usage of {Latin} – Berker Yüceer Oct 22 '12 at 10:48

socha23 · Answer 2 · 2012-10-16T20:43:45.550

10

Instead of whitelisting characters you accept, you could try blacklisting illegal characters:

var desired = stringToReplace.replace(/[-'`~!@#$%^&*()_|+=?;:'",.<>\{\}\[\]\\\/]/gi, '')

edited Oct 16 '12 at 20:43

answered Dec 01 '11 at 12:15

socha23

10,171
2
28
25

you're right, minus should be at the beginning of the illegal character list. I've updated my answer. – socha23 Oct 16 '12 at 20:44
Blacklisting ist not so good, because there are many unwanted characters, for example control codes, and it is difficult to get this right. You really sould only do whitelisting. – nalply Oct 18 '12 at 18:49
True, but I'm not aware of any character class in javascript regular expressions that would contain all the special national characters. If you want to use whitelisting, then you should probably use an external library, as in Tim Down's aswer. – socha23 Oct 19 '12 at 14:58
1

@nalply: Whitelisting is not so good, because there are many wanted characters, for example combining accents, and it is difficult to get this right. You really should only do blacklisting. – Dietrich Epp Oct 21 '12 at 16:31
@DietrichEpp, LOL. I proposed a meta-programming solution to whitelist everything. But a meta-programming solution to blacklist everything would also make sense. This is a good one. – nalply Oct 21 '12 at 17:31

nalply · Answer 3 · 2016-03-29T14:47:07.617

Note! Works only for 16bit code points. This answer is incomplete.

Short answer

The character class for all arabic digits and latin letters is: [0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06].

To get a regex you can use, prepend /^ and append +$/. This will match strings consisting of only latin letters and digits like "mérito" or "Schönheit".

To match non-digits or non-letter characters to remove them, write a ^ as first character after the opening bracket [ and prepend / and append +/.

How did I find that out? Continue reading.

Long answer: use metaprogramming!

Because Javascript does not have Unicode regexes, I wrote a Python program to iterate over the whole of Unicode and filter by Unicode name. It is difficult to get this right manually. Why not let the computer do the dirty and menial work?

import unicodedata
import re
import sys

def unicodeNameMatch(pattern, codepoint):
  try:
    return re.match(pattern, unicodedata.name(unichr(codepoint)), re.I)
  except ValueError:
    return None

def regexChr(codepoint):
  return chr(codepoint) if 32 <= codepoint < 127 else "\\u%04x" % codepoint

names = sys.argv
prev = None

js_regex = ""
for codepoint in range(pow(2, 16)):
  if any([unicodeNameMatch(name, codepoint) for name in names]):
    if prev is None: js_regex += regexChr(codepoint)
    prev = codepoint
  else:
    if not prev is None: js_regex += "-" + regexChr(prev)
    prev = None

print "[" + js_regex + "]"

Invoke it like this: python char_class.py latin digit and you get the character class mentioned above. It's an ugly char class but you know for sure that you catched all characters whose names contain latin or digit.

Browse the Unicode Character Database to view the names of all unicode characters. The name is in uppercase after the first semicolon, for example for A its the line

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Try python char_class.py "latin small" and you get a character class for all latin small letters.

Edit: There is a small misfeature (aka bug) in that \u271d-\u271d occurs in the regex. Perhaps this fix helps: Replace

if not prev is None: js_regex += "-" + regexChr(prev)

by

if not prev is None and prev != codepoint: js_regex += "-" + regexChr(prev)

Thanks for this, but it seems some characters are missing. /^[0-9A-Za-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02af\u1d00-\u1d25\u1d62-\u1d65\u1d6b-\u1d77\u1d79-\u1d9a\u1e00-\u1eff\u2090-\u2094\u2184-\u2184\u2488-\u2490\u271d-\u271d\u2c60-\u2c7c\u2c7e-\u2c7f\ua722-\ua76f\ua771-\ua787\ua78b-\ua78c\ua7fb-\ua7ff\ufb00-\ufb06]+$/.test("ستفتاء.مصر") returns false. — Erin Ishimoticha, Apr 24 '14 at 14:06
It's a character class for LATIN letters (and arabic digits). No wonder arabic text won't match. — nalply, Apr 24 '14 at 17:39
My fault! I missed that. Not that I would recognize Arabic digits vs. non-digits, anyway. :( — Erin Ishimoticha, Apr 26 '14 at 01:40
Try `python char_class.py "arabic letter"` to get a character class for arabic letters only. If this doesn't cover your needs, have a look at the Unicode Character Database (see link above), for example `0620;ARABIC LETTER KASHMIRI YEH;Lo;0;AL;;;;;N;;;;;`. The python character class generator looks at the second field, for example `ARABIC LETTER KASHMIRI YEH` and if the parameter matches the field it's included. — nalply, Apr 26 '14 at 08:14

score 2 · Answer 4 · edited May 23 '17 at 11:46

2

var desired = stringToReplace.replace(/[\u0000-\u007F][\W]/gi, '');

might do the trick.

See also this Javascript + Unicode regexes question.

edited May 23 '17 at 11:46

Community

1
1

answered Dec 01 '11 at 11:54

Density 21.5

1,955
14
17

1

This regex does not make a lot of sense. What did you want to express? – Tomalak Dec 01 '11 at 12:11

score 1 · Answer 5 · edited May 23 '17 at 12:02

1

If you must insist on whitelisting here is the rawest way of doing it:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

It works by keeping track of 'all' unicode letter chars.

edited May 23 '17 at 12:02

Community

1
1

answered Oct 21 '12 at 16:26

Capstone

2,254
2
20
39

Martin Ender · Answer 6 · 2012-10-17T16:24:22.173

Unfortunately, Javascript does not support Unicode character properties (which would be just the right regex feature for you). If changing the language is an option for you, PHP (for example) can do this:

preg_replace("/[^\pL0-9_\s]/", "", $str);

Where \pL matches any Unicode character that represents a letter (lower case, upper case, modified or unmodified).

If you have to stick with JavaScript and cannot use the library suggested by Tim Down, the only options are probably either blacklisting or whitelisting. But your bounty mentions that blacklisting is not actually an option in your case. So you will probably simply have to include the special characters from your relevant language manually. So you could simply do this:

var desired = stringToReplace.replace(/[^\w\sñáóí]/gi, '');

Or use their corresponding Unicode sequences:

var desired = stringToReplace.replace(/[^\w\s\u00F1\u00C1\u00F3\u00ED]/gi, '');

Then simply add all the ones you want to take care of. Note that the case-insensitive modifier also works with Unicode sequences.

Regex to remove non-letter characters but keep accented letters

6 Answers6

Short answer

Long answer: use metaprogramming!

Linked