Unicode character support as range in regex

Question

I am writing a regex to support alphabets in both lower and upper case, digits, - and Unicode characters within the range 00C0-00FF.

I have seen answers explaining supporting all language characters using regex \p{L}+ but I don't want to support all language characters. I only want to support a specific range [00C0-00FF] of Unicode characters from URL https://unicode-table.com/en/blocks/latin-1-supplement/

I tested my example string O’Donnell À Ö ö Ì ÿ 012 on website https://regex101.com/ with pattern [A-Za-z0-9\x{00C0}-\x{00FF}'’\- ]{1,70} but this pattern [A-Za-z0-9\x{00C0}-\x{00FF}'’\- ]{1,70} doesn't work in java. May you support me for writing equivalent pattern for Java.

Sample Code I am using to test regex -

public static void main(String... args) {
        Pattern p = Pattern.compile("[A-Za-z0-9\\x{00C0}-\\x{00FF}'’\\- ]{1,70}",
                                    Pattern.UNICODE_CHARACTER_CLASS);
        Matcher m = p.matcher("O’Donnell À Ö ö Ì ÿ 012");
        boolean b = m.matches();
        System.out.println("value=" + b);
    }

Also relevant: https://stackoverflow.com/questions/10664434/escaping-special-characters-in-java-regular-expressions — Ani, Feb 17 '21 at 17:52

Ani · Answer 1 · 2021-02-23T17:46:20.760

1

Use \\u instead of \x and remove the curly brackets and add escape sequences in your regex so it becomes:

"[A-Za-z0-9\\u00C0-\\u00FF'’\\- ]{1,70}"

edited Feb 23 '21 at 17:46

answered Feb 17 '21 at 16:11

Ani

532
3
13

On Which java version are you testing? – Bagesh Sharma Feb 17 '21 at 16:19
I am using Java 14 – Ani Feb 17 '21 at 16:33

score 1 · Accepted Answer · answered Feb 18 '21 at 15:41

Although the answer posted above is working fine but it may fail on windows machine due to windows file editor encoding issue. For Unicode characters, UTF-8 encoding should be used to save files. It's also good to use Unicode values of special characters in strings as explained in the below example.

import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {
        String str = "O'Donnell \u00C0 \u00D6 \u00F6 \u00CC \u00FF 012"; // Unicode value of string 'O’Donnell À Ö ö Ì ÿ 012'
        System.out.println(Pattern.matches("[A-Za-z0-9\\u00C0-\\u00FF'’\\- ]{1,70}", str));
    }
}

Unicode character support as range in regex

2 Answers2