How to match all accented forms of a particular character?

Question

I'd like to write a regular expression which will match all accented forms of a particular character in text encoded using some Unicode encoding, without explicitly listing out all such forms in a character class.

So, for example, if I'd like to match any accented version of a, [aàáâãäå] is insufficient, as it gets only the a's which live in ISO-8859-1, and there may well be other accents which don't occur there. Something which would be acceptable is something like \p{Base_Character: a}, were there such a thing defined in Unicode. Does something which does this exist?

Edit: I can't ASCIIfy the string first---the string is in a database I don't have direct access to. I don't have code-level access to anything here, in fact. The only input I can give is a regex.

score 0 · Answer 1 · answered Jan 23 '12 at 18:36

No, no libraries exist that do anything other than list the related codes for accented versions. Even within UTF-8, I do not see any discernable patterns among the codes. Honestly though, making that list of other accented versions wouldn't take too long.

score 0 · Answer 2 · edited May 23 '17 at 12:27

I don't think you can do that. A workaround that could help, depending on your language/platform and needs, is to "ascii-fy" your string before matching the a. For example, in Java:

    String s1 = "Hernán";
    String s2  = Normalizer.normalize(s1, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
            // s2: "Hernan"
    System.out.println(s2);
    System.out.println(s2.matches(".*a.*"));

How to match all accented forms of a particular character?

2 Answers2