50

Right now my regex is something like this:

[a-zA-Z0-9] but it does not include accented characters like I would want to. I would also like - ' , to be included.

Exn
  • 761
  • 3
  • 9
  • 19

5 Answers5

41

Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$

Please see the demo (you can add characters to test).

Explanation

  • (?i) sets case-insensitive mode
  • The ^ anchor asserts that we are at the beginning of the string
  • (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
  • The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
  • [-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table

zx81
  • 41,100
  • 9
  • 89
  • 105
27

You put in your expression:

\p{L}\p{M}

This in Unicode will match:

  • any letter character (L) from any language
  • and marks (M)(i.e, a character that is to be combined with another: accent, etc.)
NightCoder
  • 1,049
  • 14
  • 22
  • 2
    You were missing the /u - the complete regex is `/\p{L}+|\p{M}+/ugm` https://regex101.com/r/H59XSX/1 – mplungjan Jul 13 '21 at 06:23
  • 1
    this addresses a different problem – NightCoder Jul 18 '21 at 22:05
  • thanks, only this nice solution works for East Europe string I have (Sójkowska) – bcag2 Jul 26 '21 at 07:59
  • 1
    Had a better mileage with `/[\p{L}\p{M}\d'-]+/ugm` which includes the hyphen (as requested), an apostrophe (which usually makes sense, too) and numbers. If you don't want numbers (which can carry meaning just like a word), just leave the tailing `\d` away. https://regex101.com/r/O4WLyd/1 – Jan Jan 02 '22 at 10:34
7

A version without the exclusion rules:

^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$

Explanation

  • The ^ anchor asserts that we are at the beginning of the string
  • [...] allows dash, apostrophe, digits, letters, and chars in a wide accented range,
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

just.jules
  • 99
  • 1
  • 4
  • 4
    Note that this misses numerous accented characters, including ӑ, ā, ć, n̈, and ō. It also includes characters the OP may not necessarily want, such as æ, Æ, Þ, þ, ß, and ø. See https://regex101.com/r/gY7rO4/263 – Andrew Faulkner Sep 14 '18 at 03:08
  • It does, however, cover the requirement of "good" coverage of the most common accented characters and easily modifiable to any readers requirements. æ, ć, n̈, ō, ß, and ø are in my requirement set. Great testing tool! – just.jules Sep 15 '18 at 06:24
  • Version 264 of the regex linked above has more matches and is a little more eloquent using case insensitive matching `/(?![×Þß÷þ])[a-zá-žàấӑệởș]/ui` https://regex101.com/r/gY7rO4/264 except unsure why the ø is removed (brød = bread in norwegian/danish) – Richard Herries Sep 11 '20 at 08:47
4

Use a POSIX character class (http://www.regular-expressions.info/posixbrackets.html):

[-'[:alpha:]0-9] or [-'[:alnum:]]

The [:alpha:] character class matches whatever is considered "alphabetic characters" in your locale.

Brian Stephens
  • 5,161
  • 19
  • 25
4

@NightCoder's answer works perfectly in PHP:

    \p{L}\p{M}

and with no brittle whitelists. Note that to get it working in javascript you need to add the unicode u flag. Useful to have a working example in javascript...

const text = `Crêpes are øh-so déclassée`
[ ...text.matchAll(  /[-'’\p{L}\p{M}\p{N}]+/giu  ) ]

will return something like...

[
    {
        "0": "Crêpes",
        "index": 0
    },
    {
        "0": "are",
        "index": 7
    },
    {
        "0": "øh-so",
        "index": 11
    },
    {
        "0": "déclassée",
        "index": 17
    }
]

Here it is in a playground... https://regex101.com/r/ifgH4H/1/

And also some detail on those regex unicode categories... https://javascript.info/regexp-unicode

chichilatte
  • 1,697
  • 19
  • 21