First: I'm very bad at reading regular expressions and handle unicode signs.
In german government IT-systems must not support all characters but a subset of the Latin_script_in_Unicode.
In the official documentation there is the following regular expression provided for XML-schema:
(([	-

 -~¡-¬®-ćĊ-ěĞ-ģĦ-ıĴ-śŞ-ūŮ-žƏƠ-ơƯ-ưƷǍ-ǔǞ-ǟǤ-ǰǴ-ǵǺ-ǿȘ-țȞ-ȟȪ-ȫȮ-ȳəʒḂ-ḃḊ-ḋḐ-ḑḞ-ḡḤ-ḧḰ-ḱṀ-ṁṄ-ṅṖ-ṗṠ-ṣṪ-ṫẀ-ẅẌ-ẓẞẠ-ầẪ-ẬẮ-ềỄ-ồỖ-ờỤ-ỹ€])|(M̂|N̂|m̂|n̂|D̂|d̂|J̌|L̂|l̂))*
I'm now trying to migrate this regular expression to Java and was wondering how to do this. For my first steps I wrote this two test methods, which are obvious a valid latin string or obvious not:
@Test
@DisplayName("OK: Just normal characters and numbers")
void testJustNormalCharacters() {
String sut = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
assertTrue(RegExPruefung.matches(sut, RegEx.T_VALIDSTRINGLATIN));
}
@Test
@DisplayName("NOK: Chinese sign")
void testChineseSign() {
String sut = "abc⺠";
assertFalse(RegExPruefung.matches(sut, RegEx.T_VALIDSTRINGLATIN));
}
To clarify: I saved the regEx in an enumeration. The following method is called in the tests. As you can see it only takes the enum value and put it into the offical matches method. For other regexes this works fine.
public static boolean matches(String stringToCheck, RegEx regExToMatch) {
return stringToCheck.matches(regExToMatch.getRegEx());
}
What I've tried so far:
1) My first try was to escape the -
with \-
to use the xml-schema expression it in the string, but this still gives me a a false on the test with only chars and numbers.
"^(([	\\-

 \\-~¡\\-¬®\\-ćĊ\\-ěĞ\\-ģĦ\\-ıĴ\\-śŞ\\-ūŮ\\-žƏƠ\\-ơƯ\\-ưƷǍ\\-ǔǞ\\-ǟǤ\\-ǰǴ\\-ǵǺ\\-ǿȘ\\-țȞ\\-ȟȪ\\-ȫȮ\\-ȳəʒḂ\\-ḃḊ\\-ḋḐ\\-ḑḞ\\-ḡḤ\\-ḧḰ\\-ḱṀ\\-ṁṄ\\-ṅṖ\\-ṗṠ\\-ṣṪ\\-ṫẀ\\-ẅẌ\\-ẓẞẠ\\-ầẪ\\-ẬẮ\\-ềỄ\\-ồỖ\\-ờỤ\\-ỹ€])|(M̂|N̂|m̂|n̂|D̂|d̂|J̌|L̂|l̂))*$"
2) Second I tried to change the regular expression to the predefined \p{isLatin}
resulting in ^\\p{isLatin}*$
, but still the test says first string is not an valid latin one.
How do I solve this problem?
edit:
I don't think it's an duplicate of "SO Java regex for support Unicode", because I think my main problem is to understand how I transfer the expression from xml-schema to java. Nevertheless the thread helps to remind me that unicode "start element" (\u
) must be escaped with double backslash.