60

Here's a fun snippet I ran into today:

/\ba/.test("a") --> true
/\bà/.test("à") --> false

However,

/à/.test("à") --> true

Firstly, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that? (I'd really like to avoid using over-the-top selectors like /(?:^|\s|'|\(\) ....)

nickf
  • 537,072
  • 198
  • 649
  • 721
  • 9
    The answer to your WTF is that Javascript doesn’t handle Unicode correctly in regular expressions. See [the standard](http://unicode.org/reports/tr18/#Compatibility_Properties) to see how it is supposed to work. Or use a language that’s standards-compliant in this regard. Just to name a few... in Perl, PHP, PCRE, and ICU regexes, `"à"` certainly matches the pattern `/\bà/`. They’re much better for Unicode work. – tchrist Mar 26 '11 at 10:18
  • you may want to remove accents & then do a simple [a-z] check. see http://stackoverflow.com/questions/990904/javascript-remove-accents-in-strings – Adriano Sep 15 '14 at 13:31

6 Answers6

71

This worked for me:

/^[a-z\u00E0-\u00FC]+$/i

With help from here

Dharman
  • 30,962
  • 25
  • 85
  • 135
Wak
  • 868
  • 8
  • 5
42

The reason why /\bà/.test("à") doesn't match is because "à" is not a word character. The escape sequence \b matches only between a boundary of word character and a non word character. /\ba/.test("a") matches because "a" is a word character. Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_].

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (e.g. /^à/). That character means the beginning of the string (unlike \b which matches at any word boundary within the string). It's most basic and standard regular expression, so it's definitely not over the top.

Riimu
  • 1,427
  • 8
  • 12
  • Ah ok that explains a lot of things, but I guess I actually said the wrong thing in my original question. I need to match at the start of a word, not a string. The reason I think the selector would be "over-the-top" would be because it would need to match the start of a string, spaces, brackets, commas, full stops... – nickf Mar 25 '11 at 19:08
  • 1
    +1 I would only add that with the `re.test()` method, one need to be aware of the behavior of the `re.lastIndex` property which contains the offset of the last match (and is where the next match attempt will start). This does not apply in this case since the method is being applied to a regex literal, but this does matter if the regex object is stored in a variable and then used more than once. – ridgerunner Mar 25 '11 at 19:12
  • 1
    Javascript is out of compliance with [The Unicode Standard](http://unicode.org/reports/tr18/#Compatibility_Properties), because the cited standard quite clearly states that things like à are absolutely intended to be matched by `\w` in regular expressions. – tchrist Mar 26 '11 at 10:17
3

If you want to match letters, whether or not they're accented, unicode property escapes can be helpful.

/\p{Letter}*/u.test("à"); // true
/\p{Letter}/u.test('œ'); // true
/\p{Letter}/u.test('a'); // true
/\p{Letter}/u.test('3'); // false
/\p{Letter}/u.test('a'); // true

Matching to the start of a word is tricky, but (?<=(?:^|\s)) seems to do the trick. The (?<= ) is a positive lookbehind, ensuring that something exists before the main expression. The (?: ) is a non-capture group, so you don't end up with a reference to this part in whatever match you use later. Then the ^ will match the start of the string if the multiline flag isn't set or the start of the line if the multiline flag is set and the \s will match a whitespace character (space/tab/linebreak).

So using them together, it would look something like:

/(?<=(?:^|\s))\p{Letter}*/u

If you want to only match accented characters to the start of the string, you'd want a negated character set for a-zA-Z.

/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("bœ") // false
/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("œb") // true

// Match characters, accented or not
let regex = /\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false
console.log(regex.test("16 tons")); // true
console.log(regex.test("3 œ")); // true

console.log('-----');

// Match characters to start of line, only match characters

regex = /(?<=(?:^|\s))\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false

console.log('----');

// Match accented character to start of word, only match characters

regex = /(?<=(?:^|\s))[^a-zA-Z]\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // false
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false
mikemaccana
  • 110,530
  • 99
  • 389
  • 494
Amy Shackles
  • 158
  • 1
  • 6
  • *This is by far the best answer* - the current one misses many letters and includes non-letter characters. I've added a link to the MDN page. – mikemaccana Sep 29 '20 at 14:08
2

Stack Overflow had also an issue with non ASCII characters in regex, you can find it here. They are not coping with word boundaries, but maybe gives you anyway useful hints.

There is another page, but he wants to match strings and not words.

I don't know, and did not find now, an anchor for your problem, but when I see what monster regexes in my first link are used, your group, that you want to avoid, is not over the top and to my opinion your solution.

Community
  • 1
  • 1
stema
  • 90,351
  • 20
  • 107
  • 135
2

const regex = /^[\-/A-Za-z\u00C0-\u017F ]+$/;
const test1 = regex.test("à");
const test2 = regex.test("Martinez-Cortez");
const test3 = regex.test("Leonardo da vinci");
const test4 = regex.test("ï");

console.log('test1', test1);
console.log('test2', test2);
console.log('test3', test3);
console.log('test4', test4);

Building off of Wak's and Cœur's answer:

/^[\-/A-Za-z\u00C0-\u017F ]+$/

Works for spaces and dashes too.

Example: Leonardo da vinci, Martinez-Cortez

Craig1123
  • 1,510
  • 2
  • 17
  • 25
1

Unicode allows for two alternative but equivalent representations of some accented characters. For example, é has two Unicode representations: '\u0039' and '\u0065\u0301'. The former is called composed form and the latter is called decomposed form. JavaScript allows for conversion between the two:

'é'.normalize('NFD') // decompose: '\u0039' -> '\u0065\u0301'
'é'.normalize('NFC') // compose: '\u0065\u0301' -> '\u0039'
'é'.length // composed form: -> 1
'é'.length // decomposed form: -> 2 (looks identical but has different representation)
'é' == 'é' // -> false (composed and decomposed strings are not equal)

The code point '\u0301' belongs to the Unicode Combining Diacritical Marks code block 0300-036F. So one way to match these accented characters is to compare them in decomposed form:

// matching accented characters
/[a-zA-Z][\u0300-\u036f]+/.test('é'.normalize('NFD')) // -> true
/\bé/.test('é') // -> false
/\bé/.test('é'.normalize('NFD')) // -> true (NOTE: /\bé/ uses the decomposed form)

// matching accented words
/^\w+$/.test('résumé') // -> false
/^(?:[a-zA-Z][\u0300-\u036f]*)+$/.test('résumé'.normalize('NFD')) // -> true
virtuoso
  • 31
  • 3
  • interesting! But it seems you got the wrong composed charcode. It seems to be '\u00e9' instead of '\u0039' for 'é' , at least in my Firefox Browser and also according to [ISO-Latin](https://de.wikipedia.org/wiki/ISO_8859-1#ISO.2FIEC_8859-1) – Sebastian Feb 12 '23 at 08:48