2

I have regular expression to validate number digits and -. I am now supporting mutibyte characters as well. So I have used unicode class to support but Its not matching. Can some one enlighten me on this

public class Test123 {

    public static void main(String[] args) {

        String test="熏肉еконcarácterbañlácaractères" ;
        Pattern pattern = Pattern.compile("^[a-zA-Z0-9_-]*$",Pattern.UNICODE_CHARACTER_CLASS);

        Matcher matcher = pattern.matcher(test);
        if(matcher.matches())
        {
            System.out.println("matched");
        }
        else{
            System.out.println("not matched");
        }
    }

}
Pshemo
  • 122,468
  • 25
  • 185
  • 269
shreekanth
  • 459
  • 2
  • 12
  • 27

3 Answers3

4

You can use the posix class \\p{Alpha}, instead of literal classes with [a-zA-Z] to match unicode and accented characters.

Example

String test = "熏肉еконcarácterbañlácaractères";
Pattern pattern = Pattern.compile("\\p{Alpha}+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = pattern.matcher(test);
while (m.find()) {
    System.out.println(m.group());
}

Output

熏肉еконcarácterbañlácaractères
Mena
  • 47,782
  • 11
  • 87
  • 106
1

Problem is that despite that flag a-z doesn't represent "all Unicode alphabetic characters" but only "characters between a and z".

UNICODE_CHARACTER_CLASS flag adds Unicode support only to predefined character classes like \w which normally represents a-zA-Z0-9_.

So try with

Pattern.compile("^[\\w-]*$",Pattern.UNICODE_CHARACTER_CLASS);
Pshemo
  • 122,468
  • 25
  • 185
  • 269
0
[\\p{L}\\p{M}]+

You can use this to match unicode letters.

\p{L} matches any kind of letter from any language
\p{M} matches a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)

See demo.

https://regex101.com/r/fM9lY3/30

vks
  • 67,027
  • 10
  • 91
  • 124