7

Is there any way in a regex to specify a match for a character with a specific diacritic? Let's say a grave accent for example. The long way to do this is to go to the Wikipedia page on the grave accent, copy all of the characters it shows, then make a character class out of them:

/[àầằèềḕìǹòồṑùǜừẁỳ]/i

That's quite tedious. I was hoping for a Unicode property like \p{hasGraveAccent}, but I can't find anything like that. Searching for a solution only comes up with questions from people trying to match characters while ignoring diacritics, which involves performing a normalization of some kind, which is not what I want.

Nate Glenn
  • 6,455
  • 8
  • 52
  • 95
  • If it's a combining character, that might be possible by [generating a list of unicode codepoints](http://stackoverflow.com/questions/17051732/algorithm-to-check-for-combining-characters-in-unicode). – kba Feb 13 '16 at 03:27
  • Make a character class out of single letters is not reliable and would not work. It would only work for precomposed letters matching strings NFC (normalization form composed). Most characters with two or more diacritics have no precomposed character. I.e. they consist of more than one code point (= character in Unicode speech). If you copy and paste them into a character class the diacritic is still a single character and will match the same single diacritics in the target string. – Helmut Wollmersdorfer Feb 20 '16 at 12:43

2 Answers2

1

It's possible with some limitations.

#!perl

use strict;
use warnings;

use Encode;
use Unicode::Normalize;
use charnames qw();
use utf8;  # source is utf-8

binmode(STDOUT, ":utf8"); # print in utf-8

my $utf8_string = 'xàaâèaêòͤ';

my $nfd_string = NFD($utf8_string); # decompose

my @chars_with_grave = $nfd_string =~
  m/
    (
      \p{L}           # one letter
      \p{M}*          # 0 or more marks
      \N{COMBINING GRAVE ACCENT}
      \p{M}*          # 0 or more marks
    )
  /xmsg;

print join(', ',@chars_with_grave), "\n";

This prints

$ perl utf_match_grave.pl 
à, è, òͤ

NOTE: The characters in the edit area are correctly displayed as combined, but stackoverflow renders them wrongly seperated.

It needs a letter as base character. Change the regex for other base characters. Mark \p{M} is maybe not exactly what you want, should be improved.

0

It's a bit of a tricky question, but it is possible. First, you have to normalize the unicode string into one of the 4 forms. Information on normalization is here, and a map of character examples with the different normalizations is here, and a good chart for the normalized characters is here. Essentially, normalizing just makes sure all the characters are in the same format when handling diacritics. Golang has great support for this, and most all languages should contain libraries to do this.

So for my example, convert your string to "Normalization Form D" (NFD) and utf32, so all unicode characters are their code points in 4 bytes.

All diacritic characters for the grave accent have 0x0300 next to the character. So you can do a regular expression search in ascii mode (NOT unicode mode) for ....\x00\x00\x03\x00. From there you'd have to extract which rune location it is in. That can be done with different methods depending on which encoding you are using.

So if you land on a division of 4, you'll know its a valid character.

Besides that, there are no official perl character groupings to do this.

Perl code as an example:

use Encode;
use Unicode::Normalize;

$StartUTF8='xàaâèaê';
$PerlEncoded=decode('utf8', $StartUTF8);
$PerlNormalized=NFD($PerlEncoded); 
$UTF32Normalized=encode('utf32', $PerlNormalized);

while($UTF32Normalized =~ /(....\x00\x00\x03\x00)/gs) {
    $Pos=pos($UTF32Normalized)-8;
    if($Pos%4==0) {
        print("$Pos\n");
    }
}

But at this point, you might as well just be doing a for loop over the characters :-\

I also tried matching without needing the position test using //c, but for some reason it wouldn't work.

/^(?:....)*?(....\x00\x00\x03\x00)/gcs

Dakusan
  • 6,504
  • 5
  • 32
  • 45
  • There is no point converting to UTF32 (and, if you're going to assume the result is UTF-32LE, you should convert to that rather than leaving it to chance). Also, the assumption that the grave accent immediately follows the base character may be incorrect when the glyph contains more than one diacritic. – rici Feb 13 '16 at 19:38
  • Indeed. It was pretty fruitless research and testing – Dakusan Feb 14 '16 at 02:31