It's a bit of a tricky question, but it is possible. First, you have to normalize the unicode string into one of the 4 forms. Information on normalization is here, and a map of character examples with the different normalizations is here, and a good chart for the normalized characters is here. Essentially, normalizing just makes sure all the characters are in the same format when handling diacritics. Golang has great support for this, and most all languages should contain libraries to do this.
So for my example, convert your string to "Normalization Form D" (NFD) and utf32, so all unicode characters are their code points in 4 bytes.
All diacritic characters for the grave accent have 0x0300 next to the character. So you can do a regular expression search in ascii mode (NOT unicode mode) for ....\x00\x00\x03\x00
. From there you'd have to extract which rune location it is in. That can be done with different methods depending on which encoding you are using.
So if you land on a division of 4, you'll know its a valid character.
Besides that, there are no official perl character groupings to do this.
Perl code as an example:
use Encode;
use Unicode::Normalize;
$StartUTF8='xàaâèaê';
$PerlEncoded=decode('utf8', $StartUTF8);
$PerlNormalized=NFD($PerlEncoded);
$UTF32Normalized=encode('utf32', $PerlNormalized);
while($UTF32Normalized =~ /(....\x00\x00\x03\x00)/gs) {
$Pos=pos($UTF32Normalized)-8;
if($Pos%4==0) {
print("$Pos\n");
}
}
But at this point, you might as well just be doing a for loop over the characters :-\
I also tried matching without needing the position test using //c, but for some reason it wouldn't work.
/^(?:....)*?(....\x00\x00\x03\x00)/gcs