RegEx for a subset of Latin script in unicode from xml-schema to java

Question

First: I'm very bad at reading regular expressions and handle unicode signs.

In german government IT-systems must not support all characters but a subset of the Latin_script_in_Unicode.

In the official documentation there is the following regular expression provided for XML-schema:

(([&#x9;-&#xa;&#xd;&#x20;-&#x7e;&#xa1;-&#xac;&#xae;-&#x107;&#x10a;-&#x11b;&#x11e;-&#x123;&#x126;-&#x131;&#x134;-&#x15b;&#x15e;-&#x16b;&#x16e;-&#x17e;&#x18f;&#x1a0;-&#x1a1;&#x1af;-&#x1b0;&#x1b7;&#x1cd;-&#x1d4;&#x1de;-&#x1df;&#x1e4;-&#x1f0;&#x1f4;-&#x1f5;&#x1fa;-&#x1ff;&#x218;-&#x21b;&#x21e;-&#x21f;&#x22a;-&#x22b;&#x22e;-&#x233;&#x259;&#x292;&#x1e02;-&#x1e03;&#x1e0a;-&#x1e0b;&#x1e10;-&#x1e11;&#x1e1e;-&#x1e21;&#x1e24;-&#x1e27;&#x1e30;-&#x1e31;&#x1e40;-&#x1e41;&#x1e44;-&#x1e45;&#x1e56;-&#x1e57;&#x1e60;-&#x1e63;&#x1e6a;-&#x1e6b;&#x1e80;-&#x1e85;&#x1e8c;-&#x1e93;&#x1e9e;&#x1ea0;-&#x1ea7;&#x1eaa;-&#x1eac;&#x1eae;-&#x1ec1;&#x1ec4;-&#x1ed3;&#x1ed6;-&#x1edd;&#x1ee4;-&#x1ef9;&#x20ac;])|(&#x4d;&#x302;|&#x4e;&#x302;|&#x6d;&#x302;|&#x6e;&#x302;|&#x44;&#x302;|&#x64;&#x302;|&#x4a;&#x30c;|&#x4c;&#x302;|&#x6c;&#x302;))*

I'm now trying to migrate this regular expression to Java and was wondering how to do this. For my first steps I wrote this two test methods, which are obvious a valid latin string or obvious not:

@Test
@DisplayName("OK: Just normal characters and numbers")
void testJustNormalCharacters() {
  String sut = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";

  assertTrue(RegExPruefung.matches(sut, RegEx.T_VALIDSTRINGLATIN));
}

@Test
@DisplayName("NOK: Chinese sign")
void testChineseSign() {
  String sut = "abc⺠";

  assertFalse(RegExPruefung.matches(sut, RegEx.T_VALIDSTRINGLATIN));
}

To clarify: I saved the regEx in an enumeration. The following method is called in the tests. As you can see it only takes the enum value and put it into the offical matches method. For other regexes this works fine.

public static boolean matches(String stringToCheck, RegEx regExToMatch) {
  return stringToCheck.matches(regExToMatch.getRegEx());
}

What I've tried so far:

1) My first try was to escape the - with \-to use the xml-schema expression it in the string, but this still gives me a a false on the test with only chars and numbers.

"^(([&#x9;\\-&#xa;&#xd;&#x20;\\-&#x7e;&#xa1;\\-&#xac;&#xae;\\-&#x107;&#x10a;\\-&#x11b;&#x11e;\\-&#x123;&#x126;\\-&#x131;&#x134;\\-&#x15b;&#x15e;\\-&#x16b;&#x16e;\\-&#x17e;&#x18f;&#x1a0;\\-&#x1a1;&#x1af;\\-&#x1b0;&#x1b7;&#x1cd;\\-&#x1d4;&#x1de;\\-&#x1df;&#x1e4;\\-&#x1f0;&#x1f4;\\-&#x1f5;&#x1fa;\\-&#x1ff;&#x218;\\-&#x21b;&#x21e;\\-&#x21f;&#x22a;\\-&#x22b;&#x22e;\\-&#x233;&#x259;&#x292;&#x1e02;\\-&#x1e03;&#x1e0a;\\-&#x1e0b;&#x1e10;\\-&#x1e11;&#x1e1e;\\-&#x1e21;&#x1e24;\\-&#x1e27;&#x1e30;\\-&#x1e31;&#x1e40;\\-&#x1e41;&#x1e44;\\-&#x1e45;&#x1e56;\\-&#x1e57;&#x1e60;\\-&#x1e63;&#x1e6a;\\-&#x1e6b;&#x1e80;\\-&#x1e85;&#x1e8c;\\-&#x1e93;&#x1e9e;&#x1ea0;\\-&#x1ea7;&#x1eaa;\\-&#x1eac;&#x1eae;\\-&#x1ec1;&#x1ec4;\\-&#x1ed3;&#x1ed6;\\-&#x1edd;&#x1ee4;\\-&#x1ef9;&#x20ac;])|(&#x4d;&#x302;|&#x4e;&#x302;|&#x6d;&#x302;|&#x6e;&#x302;|&#x44;&#x302;|&#x64;&#x302;|&#x4a;&#x30c;|&#x4c;&#x302;|&#x6c;&#x302;))*$"

2) Second I tried to change the regular expression to the predefined \p{isLatin} resulting in ^\\p{isLatin}*$, but still the test says first string is not an valid latin one.

How do I solve this problem?

edit: I don't think it's an duplicate of "SO Java regex for support Unicode", because I think my main problem is to understand how I transfer the expression from xml-schema to java. Nevertheless the thread helps to remind me that unicode "start element" (\u) must be escaped with double backslash.

Regex does not know what ` ` means, that's an XML character entity. As Wiktor says, convert it to something regex will recognize. More info here: https://www.regular-expressions.info/refunicode.html — miken32, Apr 30 '19 at 16:13
Possible duplicate of [Java regex for support Unicode?](https://stackoverflow.com/questions/10894122/java-regex-for-support-unicode) — miken32, Apr 30 '19 at 16:14
@WiktorStribiżew I think that was the main thing I didn't know. Thank you for your comment! — bish, May 01 '19 at 17:18
@miken32 As written in my question I wasn't able to transfer the regex, because I didn't know how to rewrite it. Thanks to the other two guys I know now. Also thank you for the link. — bish, May 01 '19 at 17:19

Pshemo · Accepted Answer · 2019-04-30T16:57:37.627

Instead of &#xHEX; you need \uHEX. Note that while &#xHEX; represents end of sequence with ;, \uHEX doesn't have ; but instead always have 4 hex values, possibly with leading zeroes.

So 	 isn't represented as \u9 but as \u0009.

Anyway you can create regex tool to replace them dynamically.

String originalRegex = "(([&#x9;-&#xa;&#xd;&#x20;-&#x7e;&#xa1;-&#xac;&#xae;-&#x107;&#x10a;-&#x11b;&#x11e;-&#x123;&#x126;-&#x131;&#x134;-&#x15b;&#x15e;-&#x16b;&#x16e;-&#x17e;&#x18f;&#x1a0;-&#x1a1;&#x1af;-&#x1b0;&#x1b7;&#x1cd;-&#x1d4;&#x1de;-&#x1df;&#x1e4;-&#x1f0;&#x1f4;-&#x1f5;&#x1fa;-&#x1ff;&#x218;-&#x21b;&#x21e;-&#x21f;&#x22a;-&#x22b;&#x22e;-&#x233;&#x259;&#x292;&#x1e02;-&#x1e03;&#x1e0a;-&#x1e0b;&#x1e10;-&#x1e11;&#x1e1e;-&#x1e21;&#x1e24;-&#x1e27;&#x1e30;-&#x1e31;&#x1e40;-&#x1e41;&#x1e44;-&#x1e45;&#x1e56;-&#x1e57;&#x1e60;-&#x1e63;&#x1e6a;-&#x1e6b;&#x1e80;-&#x1e85;&#x1e8c;-&#x1e93;&#x1e9e;&#x1ea0;-&#x1ea7;&#x1eaa;-&#x1eac;&#x1eae;-&#x1ec1;&#x1ec4;-&#x1ed3;&#x1ed6;-&#x1edd;&#x1ee4;-&#x1ef9;&#x20ac;])|(&#x4d;&#x302;|&#x4e;&#x302;|&#x6d;&#x302;|&#x6e;&#x302;|&#x44;&#x302;|&#x64;&#x302;|&#x4a;&#x30c;|&#x4c;&#x302;|&#x6c;&#x302;))*";

Pattern p = Pattern.compile("&#x(?<hex>[0-9a-z]{1,4});", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(originalRegex);

StringBuffer sb = new StringBuffer();
while(m.find()){
    int decValue = Integer.parseInt(m.group("hex"), 16);
    String replacement = String.format("\\u%04x", decValue);
    m.appendReplacement(sb, Matcher.quoteReplacement(replacement)); // quoteReplacement to escape "\"
}
m.appendTail(sb);
String replacedRegex = sb.toString();
//System.out.println(replacedRegex);

which gives us (([\u0009-\u000a\u000d\u0020-\u007e\u00a1-\u00ac\u00ae-\u0107\u010a-\u011b\u011e-\u0123\u0126-\u0131\u0134-\u015b\u015e-\u016b\u016e-\u017e\u018f\u01a0-\u01a1\u01af-\u01b0\u01b7\u01cd-\u01d4\u01de-\u01df\u01e4-\u01f0\u01f4-\u01f5\u01fa-\u01ff\u0218-\u021b\u021e-\u021f\u022a-\u022b\u022e-\u0233\u0259\u0292\u1e02-\u1e03\u1e0a-\u1e0b\u1e10-\u1e11\u1e1e-\u1e21\u1e24-\u1e27\u1e30-\u1e31\u1e40-\u1e41\u1e44-\u1e45\u1e56-\u1e57\u1e60-\u1e63\u1e6a-\u1e6b\u1e80-\u1e85\u1e8c-\u1e93\u1e9e\u1ea0-\u1ea7\u1eaa-\u1eac\u1eae-\u1ec1\u1ec4-\u1ed3\u1ed6-\u1edd\u1ee4-\u1ef9\u20ac])|(\u004d\u0302|\u004e\u0302|\u006d\u0302|\u006e\u0302|\u0044\u0302|\u0064\u0302|\u004a\u030c|\u004c\u0302|\u006c\u0302))*

NOTE: you can't copy-paste that into string literal (like "(([\u0009-\u000a...)" because of characters like \u0009. Before compilation Java converts all \uXXXX from source code into characters they represent so code like

String str = "foo\u0009bar";

is seen as if it was written like

String str = "foo
bar";

which is not valid Java (strings literals can't contain line separators directly in them, instead they represent them with \n and/or \r)

But you can pass \u0009 to regex engine if you escape \ like \\u0009, for instance

String replacedRegex = "(([\\u0009-\\u000a\\u000d\\u0020-\\u007e\\u00a1-\\u00ac\\u00ae-\\u0107\\u010a-\\u011b\\u011e-\\u0123\\u0126-\\u0131\\u0134-\\u015b\\u015e-\\u016b\\u016e-\\u017e\\u018f\\u01a0-\\u01a1\\u01af-\\u01b0\\u01b7\\u01cd-\\u01d4\\u01de-\\u01df\\u01e4-\\u01f0\\u01f4-\\u01f5\\u01fa-\\u01ff\\u0218-\\u021b\\u021e-\\u021f\\u022a-\\u022b\\u022e-\\u0233\\u0259\\u0292\\u1e02-\\u1e03\\u1e0a-\\u1e0b\\u1e10-\\u1e11\\u1e1e-\\u1e21\\u1e24-\\u1e27\\u1e30-\\u1e31\\u1e40-\\u1e41\\u1e44-\\u1e45\\u1e56-\\u1e57\\u1e60-\\u1e63\\u1e6a-\\u1e6b\\u1e80-\\u1e85\\u1e8c-\\u1e93\\u1e9e\\u1ea0-\\u1ea7\\u1eaa-\\u1eac\\u1eae-\\u1ec1\\u1ec4-\\u1ed3\\u1ed6-\\u1edd\\u1ee4-\\u1ef9\\u20ac])|(\\u004d\\u0302|\\u004e\\u0302|\\u006d\\u0302|\\u006e\\u0302|\\u0044\\u0302|\\u0064\\u0302|\\u004a\\u030c|\\u004c\\u0302|\\u006c\\u0302))*";

Now lets test if that regex works as intended:

Pattern RegExPruefung = Pattern.compile(replacedRegex);

String sut = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
System.out.println(RegExPruefung.matcher(sut).matches());
sut = "abc⺠";
System.out.println(RegExPruefung.matcher(sut).matches());

Output:

true
false

Thank you for the detailed explanation for transfering the regex including the pitfalls! May I ask you to give a word about `Pattern.UNICODE_CHARACTER_CLASS`? It's mentioned in the linked question but you didn't use it and in my own test it seems that it's not needed at all? — bish, May 01 '19 at 17:13
@bish From https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#ubpc it looks like this flag is used when you want to enable Unicode support for other languages when using "Predefined Character classes and POSIX character classes" like `\w` `\d` or `\p{Digit}`. But since you don't use any of them and instead regex explicitly states character ranges via `\uXXXX-\uYYYY` that flag would be redundant. — Pshemo, May 01 '19 at 18:28

RegEx for a subset of Latin script in unicode from xml-schema to java

1 Answers1