usage of accented letters in regex in string mutations

Question

How can I modify my regex code for string mutations so that it also works for accented letters? For example a string mutation in reges for "amor" should be the same as the one for "āmōr". I tried to just simply include the accented letters like ´(?<=[aeiouāēīōūăĕĭŏŭ])´ but that did not work.

My code:

$hyphenation = '~
(?<=[aeiou]) #each syllable contain a vowel
(?:
    # Muta cum liquida
    ( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
  |
    [bcdfghlmnp-tx]
    (?:
        # ct goes together

        [cp] \K (?=t)
      |
        # two or more consonants are splitted up
        \K (?= [bcdfghlmnp-tx]+ [aeiou]) 
    )   
  |
    # a consonant and a vowel go together
    (?:
        \K (?= [bcdfghlmnp-t] [aeiou])
      | 
        #  "x" goes to the preceding vowel
        x \K (?= [a-z] | (*SKIP)(*F) ) 
    )
  |
    # two vowels are splitted up except ae oe...
    \K (?= [aeiou] (?<! ae | oe | au | que | qua | quo | qui ) ) 
)
~xi';


// hyphention
$result = preg_replace($hyphenation, '-$1', $input);

You should show more of your code (a complete example to reproduce the problem). I suspect a wrong approach. — Casimir et Hippolyte, Nov 06 '16 at 20:47
You could first [remove diacritics](http://stackoverflow.com/questions/3635511/remove-diacritics-from-a-string) then try to match what you want. Or you could `\pL` (any letter from any language) or `\pM` (match a character and his derivatives) with the unicode (u) flag. — Nicolas, Nov 06 '16 at 20:55
@Croutonix I need the diacritics for later. Is it possible to remove the diacritics and later paste ti again? What does ´\pL´or ´ \pM´ have for a function? — , Nov 07 '16 at 14:24
@ChrisWinterbottom You could assign your text to a new string then remove the diacritics, run the regex and for each matches, find a substring of original string with match's start and end. — Nicolas, Nov 07 '16 at 20:48

score 0 · Accepted Answer · answered Nov 07 '16 at 20:34

An accented letter can be figured in several ways in unicode. For example ā can be the unicode code point U+0101 (LATIN SMALL LETTER A WITH MACRON), but it can be also the combination of U+0061 (LATIN SMALL LETTER A) and U+0304 (COMBINING MACRON). (link)

Consequence, writing (?<=[aeiouāēīōūăĕĭŏŭ]) is correct if:

you use the u modifier to inform the pcre regex engine that your string and your pattern must be read as UTF-8 strings. Otherwise multi-byte characters are seen as separated bytes and not as something atomic (This can be problematic and produce weird results in particular when multibyte characters are inside a character class. For example [eā]+ will match "ē").
you are sure that the target string and the pattern use the same form for each letter. If the pattern use U+0101 and the string U+0061 with U+0304 for "ā", it will not work. To prevent this problem, you can apply $str = Normalizer::normalize($str); to the subject string. This method comes from the intl extension.

You can find more informations following these links:

https://en.wikipedia.org/wiki/Unicode_equivalence
http://utf8-chartable.de/
http://php.net/manual/en/normalizer.normalize.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://pcre.org/original/pcre.txt

how can I use the u modifier. I tried to it like that `$result = preg_replace($hyphenation, '-$1',$input."u");` — , Nov 21 '16 at 14:01
@ChrisWinterbottom: no, you need to put it with the other modifiers x and i at the end of the pattern, after the delimiter. — Casimir et Hippolyte, Nov 21 '16 at 14:07

usage of accented letters in regex in string mutations

1 Answers1