0

I have this regex

\b(t[úu]s*)\b

And i have this words:

tu (works)
tú (doesn't work)
tus (works)
tús (works)

Why can't I match ?

GSerg
  • 76,472
  • 17
  • 159
  • 346
Nicopag
  • 61
  • 1
  • 8
  • I am using PHP and I am testing in http://gskinner.com/RegExr/ – Nicopag Mar 27 '13 at 12:57
  • 1
    but it's matching: http://rubular.com/r/CS7wRf7y4N – Gustavo F Mar 27 '13 at 12:58
  • Perhaps this question can help: http://stackoverflow.com/q/2133758/1649067 –  Mar 27 '13 at 13:00
  • Thanks Gustavo. I dont know why gskinner.com doesnt work with that. thanks – Nicopag Mar 27 '13 at 13:04
  • 3
    Apart from single pre-built character, there are also characters that has to be built from a base character + combining diacritics. Perhaps this is the problem? – nhahtdh Mar 27 '13 at 13:04
  • I don't think regex is very unicode-aware. More bytes and characters than codepoints =( I might be wrong though. Maybe PHP has a modifier to enable unicode-awareness to make regexing for all `U` variants easier. – Rudie Mar 27 '13 at 13:17

2 Answers2

3

If the regex doesn't match, the two characters differ.

"u with acute" can be expressed as the single Character ú (U+00FA) or by combining u (U+0075) with the combining acute accent character (U+0301) which gives a similar looking ú.

You have to either convert your input string or include both variants in you regular expression, see http://www.regular-expressions.info/unicode.html for details.

Stefan
  • 109,145
  • 14
  • 143
  • 218
2

Why doesn't that expression match ?

That expression doesn't match because \b doesn't seem to recognize ú as a word character, and thus fails when used between non-word characters.

You could use something like this instead:

/(?<!\p{L})(t[úu]s*)(?!\p{L})/u

\p{L} matches a unicode letter.

Qtax
  • 33,241
  • 9
  • 83
  • 121