Javascript - regex to remove special characters but also keep greek characters

Question

I am trying to remove special characters from a piece of text, but using the following regular expression

var desired = stringToReplace.replace(/[^\w\s]/gi, '')

(found here: javascript regexp remove all special characters)

has the negative effect that deletes greek characters and this is something I don't want.

Can someone also explain me how to use character ranges in regular expressions? Is there a character map which can help me define the range I want?

Answer:

[a-zA-Z0-9ΆΈ-ώ\s]   # See my 2nd comment under Joeytje50's answer.

You need to define what you mean by “greek characters”. Do you mean letters and punctuation marks used in modern Greek, or any characters that belong to the Greek script (Greek writing system)? — Jukka K. Korpela, Apr 27 '14 at 19:19

Joeytje50 · Accepted Answer · 2014-04-27T19:23:56.040

The way these ranges are defined is based on their character code. So, since A has char code 65, and z has char code 122, the following regex:

[A-z]

would match every letter, but also every character with char codes that fall between those char codes, namely those with codes 91 through 95, which would be the characters [\]^_. (demo).

Now, for Greek letters, the character codes for the uppercase characters are 913-937 for alpha through omega, and the lowercase characters are 945-969 for alpha through omega (this includes both lowercase variants of sigma, namely ς (962) and σ (963)).

So, to match every character except for latin letters, greek letters, and arabic numerals, you need the following regex:

[a-zA-Z0-9α-ωΑ-Ω]

So, for greek characters, it works just like latin letters.

Edit: I've tested this via a Google Translate'd Lipsum, and it looks like this doesn't take accented letters into account. I've checked what the character codes for these accented letters were, and it turns out they are placed right before the lowercase letters, or right after the uppercase letters. So, the following regex works for all greek letters, including accented ones:

[a-zA-Z0-9ά-ωΑ-ώ]

Demo

This expanded range now also includes άέήίΰ (char codes 940 through 944) and ϊϋόύώ (codes 970 through 974).

To also include whitespace (spaces, tabs, newlines), simply include a \s in the range:

[a-zA-Z0-9ά-ωΑ-ώ\s]

Demo.

Edit: Apparently there are more Greek letters that needed to be included in this range, namely those in the range [Ά-Ϋ], which is the range of letters right before the ά, so the new regex would look like this:

[a-zA-Z0-9Ά-ωΑ-ώ\s]

Demo.

Thank you! Very clear answer. But, is there an easy way to keep spaces and line feeds? (Because [^a-zA-Z0-9α-ωΑ-Ω] removes them) — tgogos, Apr 27 '14 at 19:03
I've added another demo to also match whitespace to my answer. — Joeytje50, Apr 27 '14 at 19:06
The pattern does not accept capital Greek letters with tonos, like Έ. — Jukka K. Korpela, Apr 27 '14 at 19:16
@JukkaK.Korpela it looks like to make that work, you'd need to include the range `[Ά-Ϋ]` too, so it'd be `[a-zA-Z0-9Ά-ωΑ-ώ\s]` then (since that new range comes right before the `[ά-ω]` range). This also, for some reason, includes a second range of uppercase greek letters. — Joeytje50, Apr 27 '14 at 19:21
Joeytje50, @JukkaK.Korpela [a-zA-Z0-9Ά-ωΑ-ώ\s] is exactly what I was looking for. [Unicode-table.com](http://unicode-table.com) also helped me a lot and I actually did a small change to exclude · (greek ano teleia) which is between Ά and Έ. The final regex is: [a-zA-Z0-9ΆΈ-ώ\s] — tgogos, Apr 28 '14 at 11:11

score 2 · Answer 2 · answered Apr 27 '14 at 18:44

2

Try adding the range of Greek characters like this:

/[^\w\sΆΈ-ϗἀ-῾]/gi

I created this pattern by looking at Unicode pages 0370 Greek and Coptic and 1F00 - Greek Extended. I don't speak Greek, and it's likely that a more restricted character set would be more appropriate, but this seems to work:

"-ἄλφα-".replace(/[^\w\sΆΈ-ϗἀ-῾]/gi, ''); // "ἄλφα"

answered Apr 27 '14 at 18:44

p.s.w.g

146,324
30
291
331

This approach seems to have the same result with [^a-zA-Z0-9α-ωΑ-Ω] but confused me a little bit because I can't understand what I am doing. Thanks for the character sets, I will take a look. – tgogos Apr 27 '14 at 19:06
@antithesis I used the extended characters to capture letters with accents, e.g. the `ἄ` in `ἄλφα`. `α-ωΑ-Ω` wouldn't handle accents, e.g. `"ἄλφα".replace(/[^\w\sα-ωΑ-Ω]/gi, '')` → `"λφα"`. Both solutions are fine, but which pattern you choose really depends on *exactly* what set of characters you need to exclude from your match. – p.s.w.g Apr 27 '14 at 19:09

Pedro Lobito · Answer 3 · 2014-04-27T18:53:42.857

0

var stringToReplace = "παράδειγμαs & /(";
var result = stringToReplace.replace(/[^\u0370-\u03FF\w\s]/mg, "");

DEMO:

http://jsfiddle.net/tuga/LKjYd/

0370-03FF Greek and Coptic Character Block

http://apps.timwhitlock.info/js/regex

edited Apr 27 '14 at 18:53

answered Apr 27 '14 at 18:46

Pedro Lobito

94,083
31
258
268

Please explain what `{InGreek and Coptic}` means. – Joeytje50 Apr 27 '14 at 18:48
I've replaced it with \u0370-\u03FF , it seems that javascript doesn't support `{InGreek and Coptic}` – Pedro Lobito Apr 27 '14 at 18:54
The pattern accepts many characters that are not letters, as well as letters that are not used in modern Greek. – Jukka K. Korpela Apr 27 '14 at 19:18

Javascript - regex to remove special characters but also keep greek characters

3 Answers3

Linked