What's a good regex to include accented characters in a simple way?

Question

Right now my regex is something like this:

[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - ' , to be included.

@AvinashRaj I guess it's one of this: http://en.wikipedia.org/wiki/Ä — Darek Nędza, Jul 10 '14 at 12:50
You can also take a look at this SO question https://stackoverflow.com/questions/20690499/concrete-javascript-regular-expression-for-accented-characters-diacritics — Joand, Feb 07 '22 at 09:01

zx81 · Accepted Answer · 2014-07-10T12:44:02.403

41

Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$

Please see the demo (you can add characters to test).

Explanation

(?i) sets case-insensitive mode
The ^ anchor asserts that we are at the beginning of the string
(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
[-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
The + matches that one or more times
The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table

edited Jul 10 '14 at 12:44

answered Jul 10 '14 at 12:38

zx81

41,100
9
89
105

Should it not be À-Ž? – Scott Flack Mar 09 '18 at 02:49
1

Not sure what language you are targeting, but you would also need to add œ and Œ for some (French among others), which falls outside the range described here... – Will59 Dec 10 '19 at 15:27
1

Doesn't match `Šš`. – Gajus Feb 19 '20 at 09:39
The demo is outdated and the regex from the answer should be copied to see it work. – CharlesG May 06 '20 at 17:01
Don't forget the period and comma for "Bob, Jr." – thdoan Jun 23 '21 at 01:09
Doesn't match ğ and why do you not like Þßþø ? Islandic, German and Danish use these – mplungjan Jul 13 '21 at 06:21
Czech uses `Č` which isn't included in this regex. Extend it with `Ā-ſ` to cover `Latin Extended-A` which includes `Č`. If you also want to include `Latin Extended-B` use `ƀ-ȳ`. Then `[-'0-9a-zÀ-ÿ]` becomes `[-'0-9a-zÀ-ÿĀ-ſƀ-ȳ]` – Niklas Nov 06 '21 at 10:17

NightCoder · Answer 2 · 2022-03-13T06:24:07.597

27

You put in your expression:

\p{L}\p{M}

This in Unicode will match:

any letter character (L) from any language
and marks (M)(i.e, a character that is to be combined with another: accent, etc.)

edited Mar 13 '22 at 06:24

answered Nov 28 '20 at 18:16

NightCoder

1,049
14
22

2

You were missing the /u - the complete regex is `/\p{L}+|\p{M}+/ugm` https://regex101.com/r/H59XSX/1 – mplungjan Jul 13 '21 at 06:23
1

this addresses a different problem – NightCoder Jul 18 '21 at 22:05
thanks, only this nice solution works for East Europe string I have (Sójkowska) – bcag2 Jul 26 '21 at 07:59
1

Had a better mileage with `/[\p{L}\p{M}\d'-]+/ugm` which includes the hyphen (as requested), an apostrophe (which usually makes sense, too) and numbers. If you don't want numbers (which can carry meaning just like a word), just leave the tailing `\d` away. https://regex101.com/r/O4WLyd/1 – Jan Jan 02 '22 at 10:34

score 7 · Answer 3 · answered Feb 22 '18 at 11:01

7

A version without the exclusion rules:

^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$

Explanation

The ^ anchor asserts that we are at the beginning of the string
[...] allows dash, apostrophe, digits, letters, and chars in a wide accented range,
The + matches that one or more times
The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table

answered Feb 22 '18 at 11:01

just.jules

99
1
4

4

Note that this misses numerous accented characters, including ӑ, ā, ć, n̈, and ō. It also includes characters the OP may not necessarily want, such as æ, Æ, Þ, þ, ß, and ø. See https://regex101.com/r/gY7rO4/263 – Andrew Faulkner Sep 14 '18 at 03:08
It does, however, cover the requirement of "good" coverage of the most common accented characters and easily modifiable to any readers requirements. æ, ć, n̈, ō, ß, and ø are in my requirement set. Great testing tool! – just.jules Sep 15 '18 at 06:24
Version 264 of the regex linked above has more matches and is a little more eloquent using case insensitive matching `/(?![×Þß÷þ])[a-zá-žàấӑệởș]/ui` https://regex101.com/r/gY7rO4/264 except unsure why the ø is removed (brød = bread in norwegian/danish) – Richard Herries Sep 11 '20 at 08:47

score 4 · Answer 4 · answered Jul 10 '14 at 12:41

4

Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):

[-'[:alpha:]0-9] or [-'[:alnum:]]

The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.

answered Jul 10 '14 at 12:41

Brian Stephens

5,161
19
25

He wants accented characters. In many engines this will not match Ô, à, é etc. – zx81 Jul 10 '14 at 12:43

chichilatte · Answer 5 · 2023-08-10T12:53:50.133

@NightCoder's answer works perfectly in PHP:

    \p{L}\p{M}

and with no brittle whitelists. Note that to get it working in javascript you need to add the unicode u flag. Useful to have a working example in javascript...

const text = `Crêpes are øh-so déclassée`
[ ...text.matchAll(  /[-'’\p{L}\p{M}\p{N}]+/giu  ) ]

will return something like...

[
    {
        "0": "Crêpes",
        "index": 0
    },
    {
        "0": "are",
        "index": 7
    },
    {
        "0": "øh-so",
        "index": 11
    },
    {
        "0": "déclassée",
        "index": 17
    }
]

Here it is in a playground... https://regex101.com/r/ifgH4H/1/

And also some detail on those regex unicode categories... https://javascript.info/regexp-unicode

What's a good regex to include accented characters in a simple way?

5 Answers5

Linked