How to handle combining characters along with the \p{L} pattern for Thai strings?

Question

I need to detect text with Unicode characters restricting it to letters only (e.g. no symbols, emojis, etc., just something that can be used in a person's name in any Unicode language). The \p{L} category seems to do the trick, but it does not recognize Thai strings. I do not speak Thai, so I got a few common Thai names from ChatGPT and they all fail in my test. Tried it at RegExr (see the Tests tab) and also wrote a simple test program:

using System.Text.RegularExpressions;

Console.OutputEncoding = System.Text.Encoding.UTF8;

string pattern = @"^[\p{L}\s]+$";

string englishText = "Mary";
Console.Write($"{englishText}: ");
Console.WriteLine(Regex.IsMatch(englishText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string germanText = "RöschenÜmit";
Console.Write($"{germanText}: ");
Console.WriteLine(Regex.IsMatch(germanText, pattern, RegexOptions.IgnoreCase).ToString()); // true

string thaiText = "อรุณรัตน์";
Console.Write($"{thaiText}: ");
Console.WriteLine(Regex.IsMatch(thaiText, pattern, RegexOptions.IgnoreCase).ToString()); // false

string japaneseText = "タクミたくみく";
Console.Write($"{japaneseText }: ");
Console.WriteLine(Regex.IsMatch(japaneseText, pattern, RegexOptions.IgnoreCase).ToString()); // true

I noticed when I try testing each individual character in the Thai string, it seems to recognize them as valid Unicode letters, but as a string, it fails. Just to make sure I do not have any hidden characters, I checked the raw values and I did not see anything suspicious. Any ideas what's going on here?

P.S. I know that some of the characters in the test are from different sets and names may include spaces, dashes, etc., but this is not the point. I'm just trying to solve the Thai strings issue here.

COMMENT: Thai string contains combining character which I guess causes the problem in detecting letters even if those look as single letter (i.e. {0e23, 0xe38} results in "รุ").

Just an fyi, your Chinese text is actually Japanese, Google can probably give you more reliable results than an AI. — Matti Virkkunen, Mar 20 '23 at 16:55
I don't know Thai language but I suspect that one or several letters in the string use combining characters. Try to add the `\p{Mn}` to your class. — Casimir et Hippolyte, Mar 20 '23 at 17:06
Alek Davis - I've updated title to say about combining characters - please review the edit and see if it aligns with your intentions, feel free to rollback. — Alexei Levenkov, Mar 20 '23 at 17:20
@Matti VirKKunen: Oh, yes, got the wrong copy (I tried it with Chinese, too, and forgot to rename variable). Fixed in sample. — Alek Davis, Mar 20 '23 at 17:52
@Alexei Levenkov: I marked your addition as a comment, since it's part of the answer. — Alek Davis, Mar 20 '23 at 18:03

Dmitry Bychenko · Accepted Answer · 2023-03-20T17:27:16.143

If we print out thaiText dump:

string thaiText = "อรุณรัตน์";

var report = string.Join(Environment.NewLine, thaiText
  .Select(c => $"{c} : \\u{(int)c:x4} : {char.GetUnicodeCategory(c)}"));

Console.WriteLine(report);

We'll get the cause of misbehaviour: NonSpacingMarks category between the OtherLetters:

อ : \u0e2d : OtherLetter
ร : \u0e23 : OtherLetter
ุ : \u0e38 : NonSpacingMark <- doesn't match
ณ : \u0e13 : OtherLetter
ร : \u0e23 : OtherLetter
ั : \u0e31 : NonSpacingMark <- doesn't match
ต : \u0e15 : OtherLetter
น : \u0e19 : OtherLetter
์ : \u0e4c : NonSpacingMark <- doesn't match

Technically, to get rid of these marks we can use normalization:

// The idea is to combine marks and letters into a letter which should match
thaiText = thaiText.Normalize(NormalizationForm.FormD);

but it doesn't work at my workstation and the reason is an issue

So if normalization doesn't work in your case as well (or you want to be on the safer side of the road), you can try match Thai symbols; either only Thai

string pattern = @"^[\p{IsThai}\s]+$";

or mixing with all the other ones (letters or Thai letters as a special case):

string pattern = @"^[\p{L}\p{IsThai}\s]+$";

or allow both letters (\p{L}) and these non-spacing marks (\p{Mn}):

string pattern = @"^[\p{L}\p{Mn}\s]+$";

Thanks, Dmitry. Since this problem may apply to other languages, I think the safest approach will be to add \p{Mn} to the pattern. Here is my complete regex for name validation: https://regexr.com/6ru33 — Alek Davis, Mar 20 '23 at 18:39
@AlekDavis: you can do the same like that https://regex101.com/r/kXD6Tk/1 . Note that `.{64}` means 64 characters, but not 64 glyphs. — Casimir et Hippolyte, Mar 20 '23 at 20:13

score 2 · Answer 2 · answered Mar 20 '23 at 17:15

It happens because there are "mark" characters that you need to match separately from letters. Some languages use these characters, e.g. also Tamil. This regex will match the Thai string:

^[\p{L}\p{M}\s]+$

Info about \p{M} from regular-expressions.info:

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

Also, comparison of string with mark characters: อรุณรัตน์ and string without them: อรณรตน - this one is matched with just p{L}.

How to handle combining characters along with the \p{L} pattern for Thai strings?

2 Answers2