Regex words with accents

Question

I have this regex

\b(t[úu]s*)\b

And i have this words:

tu (works)
tú (doesn't work)
tus (works)
tús (works)

Why can't I match tú?

I am using PHP and I am testing in http://gskinner.com/RegExr/ — Nicopag, Mar 27 '13 at 12:57
Perhaps this question can help: http://stackoverflow.com/q/2133758/1649067 — , Mar 27 '13 at 13:00
Thanks Gustavo. I dont know why gskinner.com doesnt work with that. thanks — Nicopag, Mar 27 '13 at 13:04
Apart from single pre-built character, there are also characters that has to be built from a base character + combining diacritics. Perhaps this is the problem? — nhahtdh, Mar 27 '13 at 13:04
I don't think regex is very unicode-aware. More bytes and characters than codepoints =( I might be wrong though. Maybe PHP has a modifier to enable unicode-awareness to make regexing for all `U` variants easier. — Rudie, Mar 27 '13 at 13:17

score 3 · Answer 1 · answered Mar 27 '13 at 13:20

If the regex doesn't match, the two characters differ.

"u with acute" can be expressed as the single Character ú (U+00FA) or by combining u (U+0075) with the combining acute accent character (U+0301) which gives a similar looking ú.

You have to either convert your input string or include both variants in you regular expression, see http://www.regular-expressions.info/unicode.html for details.

score 2 · Answer 2 · answered Mar 27 '13 at 14:00

2

Why doesn't that expression match tú?

That expression doesn't match tú because \b doesn't seem to recognize ú as a word character, and thus fails when used between non-word characters.

You could use something like this instead:

/(?<!\p{L})(t[úu]s*)(?!\p{L})/u

\p{L} matches a unicode letter.

answered Mar 27 '13 at 14:00

Qtax

33,241
9
83
121

Hi, I try this (\p{L})(t[úu]s*)(\p{L})/u but doesnt work :( – Nicopag Mar 28 '13 at 02:15
@user2088434, why don't you copy the whole expression? You are missing several parts from it. – Qtax Mar 28 '13 at 13:09

Regex words with accents

2 Answers2