8

I m trying to match unicode characters in Java.

Input String: informa

String to match : informátion

So far I ve tried this:

Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
    String s = "informátion";
    Matcher m = p.matcher(s);
    if(m.matches()){
        System.out.println("Match!");
    }else{
        System.out.println("No match");
    }

It comes out as "No match". Any ideas?

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
ankimal
  • 915
  • 3
  • 9
  • 22

3 Answers3

12

The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".

In regex terms that would be [^\x20-\x7E].

boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");

Depending on what you'd like to do with this information, here are some useful follow-up answers:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • The java.text.Normalizer seems like the way to go(bullet 2). Unicode matching just doesnt seem to work as expected and I might be hit with a performance penalty even if it did. – ankimal Jun 23 '10 at 22:49
  • If your *actual* functional requirement is "get rid of diacritical marks", then it's indeed the way to go. Your initial question was only not formulated like that :) – BalusC Jun 23 '10 at 22:58
  • I think the question wasnt crystal clear. The goal was to be able to match "information" with "informátion", thus the ability to match 'a' with any forms of a like 'á','å' etc. Removing diacritical marks and then matching seems to be the way to go. – ankimal Jun 23 '10 at 23:16
6

Is it because informa isn't a substring of informátion at all?

How would your code work if you removed the last a from informa in your regex?

Austin Fitzpatrick
  • 7,243
  • 3
  • 25
  • 22
  • informa\u0301 works in the pattern string. This has to do with the Pattern.CANON_EQ case. – ankimal Jun 23 '10 at 18:54
  • Forgot to put in the link for this, http://java.sun.com/docs/books/tutorial/essential/regex/pattern.html (Pattern.CANON_EQ) – ankimal Jun 23 '10 at 23:09
1

It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.

String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...

To learn more about NFD:

james.garriss
  • 12,959
  • 7
  • 83
  • 96