4

I'd like to write a regular expression which will match all accented forms of a particular character in text encoded using some Unicode encoding, without explicitly listing out all such forms in a character class.

So, for example, if I'd like to match any accented version of a, [aàáâãäå] is insufficient, as it gets only the a's which live in ISO-8859-1, and there may well be other accents which don't occur there. Something which would be acceptable is something like \p{Base_Character: a}, were there such a thing defined in Unicode. Does something which does this exist?

Edit: I can't ASCIIfy the string first---the string is in a database I don't have direct access to. I don't have code-level access to anything here, in fact. The only input I can give is a regex.

uckelman
  • 25,298
  • 8
  • 64
  • 82

2 Answers2

0

No, no libraries exist that do anything other than list the related codes for accented versions. Even within UTF-8, I do not see any discernable patterns among the codes. Honestly though, making that list of other accented versions wouldn't take too long.

mvrak
  • 501
  • 3
  • 12
0

I don't think you can do that. A workaround that could help, depending on your language/platform and needs, is to "ascii-fy" your string before matching the a. For example, in Java:

    String s1 = "Hernán";
    String s2  = Normalizer.normalize(s1, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
            // s2: "Hernan"
    System.out.println(s2);
    System.out.println(s2.matches(".*a.*"));
Community
  • 1
  • 1
leonbloy
  • 73,180
  • 20
  • 142
  • 190