How to match accented characters with a regex?

Question

I am running Ruby on Rails 3.0.10 and Ruby 1.9.2. I am using the following Regex in order to match names:

NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u

validates :name,
  :presence   => true,
  :format     => {
    :with     => NAME_REGEX,
    :message  => "format is invalid"
  }

However, if I try to save some words like the followings:

Oilalà
Pì
Rùby
...

# In few words, those with accented characters

I have a validation error "Name format is invalid..

How can I change the above Regex so to match also accented characters like à, è, é, ì, ò, ù, ...?

Strange: if you do it from the command line, it works: `irb(main):019:0> "làasdasd".scan /^[\w\s'"\-_&@!?()\[\]-]*$/u => ["l\303\240asdasd"]`; and doesn't work if you omit the unicode modifier. — seb, Sep 03 '11 at 10:25

Lars Haugseth · Accepted Answer · 2011-09-03T11:47:13.757

60

Instead of \w, use the POSIX bracket expression [:alpha:]:

"blåbær dèjá vu".scan /[[:alpha:]]+/  # => ["blåbær", "dèjá", "vu"]

"blåbær dèjá vu".scan /\w+/  # => ["bl", "b", "r", "d", "j", "vu"]

In your particular case, change the regex to this:

NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u

This does match much more than just accented characters, though. Which is a good thing. Make sure you read this blog entry about common misconceptions regarding names in software applications.

edited Sep 03 '11 at 11:47

answered Sep 03 '11 at 11:20

Lars Haugseth

14,721
2
45
49

Can you show me how to exactly use the '[:alpha:]' in my regex? – user502052 Sep 03 '11 at 11:23
1

Just replace `\w` with `[:alpha:]` in your regular expression. – Benoit Garret Sep 03 '11 at 11:27
4

Actually, **\w** should be replaced with **[:alnum:]**. And if you want to match something that isn't an alphanumeric character, just replace **[^\w]** with **[[:^alnum:]]**. – Guilherme Garnier Mar 23 '12 at 17:29
3

I could not get `[:alpha:]` to work with Ruby 2.2.2, but was able to get `\p{Alpha}` to work. See the Ruby `Regexp` Character Properties http://ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Properties. – Powers Sep 24 '15 at 21:27
Why doesn't \w match though? Seems like bad implementation of regex to me. – Nyerguds Apr 27 '16 at 09:24
Use `[:word:]` POSIX to get the exact same behaviour as `\w`, it includes digits and underscore. – Dinatih Feb 05 '17 at 17:35
this does not work for slovaks accents : ex : VAŽECKÁ...accents are not matched – user7082181 Apr 03 '19 at 06:27

Andreas · Answer 2 · 2011-09-03T10:20:27.823

0

One solution would of course be to simply find all of them just use them as you normally would, although I assume they can be fairly many.

If you are using UTF8 then you will find that such characters are often split into two parts, the "base" character itself, followed by the accent (0x0300 and 0x0301 I believe) also called a combining character. However, this may not always be true since some characters can also be written using the "hardcoded" character code... so you need to normalize the UTF8 string to NFD form first.

Of course, you could also turn any string you have into UTF8 and then back into the original charset... but the overhead might become quite large if you are doing bulk operations.

EDIT: To answer your question specifically, the best solution is likely to normalize your strings into UTF8 NPD form, and then simply add 0x0300 and 0x0301 to your list of acceptable characters, and whatever other combining characters you want to allow (such as the dots in åäö, you can find them all in "charmap" in Windows, look at 0x0300 and "up").

edited Sep 03 '11 at 10:20

answered Sep 03 '11 at 10:14

Andreas

2,261
1
17
25

So, what I should do in order to accomplish what I aim to make? And how? – user502052 Sep 03 '11 at 10:21
I edited my answer to include a short explanation for your specific case, have a look at it and see if it helps. Otherwise just comment again. – Andreas Sep 03 '11 at 10:24
If you want to be a bit lenient and aren't too concerned with users abusing strange names, just add 0x0300-0x036F to your list of characters, that will include all combining characters. – Andreas Sep 03 '11 at 10:28
Also, depending on your specific purpose, you could probably just use a unicode character property in your regex instead, I'm not sure how they look in your langauge, but in PHP one can write \pL and it will accept any "letter" from any language (again, can be open to intentional abuse, there are a lot of letters that you likely wouldn't consider to be a letter). http://www.regular-expressions.info/unicode.html – Andreas Sep 03 '11 at 10:30
Seeing as I'm not allowed to comment on the answer below, just be careful with unicode character properties, such as [:alpha:], they match "åäö", but they also happily match "ﻩﻷﻼ﷼ﮬ₳ᵭᵰݡᴃ" and such... that is, anything that could be considered a letter in any language. – Andreas Sep 03 '11 at 11:28

How to match accented characters with a regex?

2 Answers2

Linked

Related