1

How can I modify my regex code for string mutations so that it also works for accented letters? For example a string mutation in reges for "amor" should be the same as the one for "āmōr". I tried to just simply include the accented letters like ´(?<=[aeiouāēīōūăĕĭŏŭ])´ but that did not work.

My code:

$hyphenation = '~
(?<=[aeiou]) #each syllable contain a vowel
(?:
    # Muta cum liquida
    ( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
  |
    [bcdfghlmnp-tx]
    (?:
        # ct goes together

        [cp] \K (?=t)
      |
        # two or more consonants are splitted up
        \K (?= [bcdfghlmnp-tx]+ [aeiou]) 
    )   
  |
    # a consonant and a vowel go together
    (?:
        \K (?= [bcdfghlmnp-t] [aeiou])
      | 
        #  "x" goes to the preceding vowel
        x \K (?= [a-z] | (*SKIP)(*F) ) 
    )
  |
    # two vowels are splitted up except ae oe...
    \K (?= [aeiou] (?<! ae | oe | au | que | qua | quo | qui ) ) 
)
~xi';


// hyphention
$result = preg_replace($hyphenation, '-$1', $input);
Flexo
  • 87,323
  • 22
  • 191
  • 272
  • You should show more of your code (a complete example to reproduce the problem). I suspect a wrong approach. – Casimir et Hippolyte Nov 06 '16 at 20:47
  • You could first [remove diacritics](http://stackoverflow.com/questions/3635511/remove-diacritics-from-a-string) then try to match what you want. Or you could `\pL` (any letter from any language) or `\pM` (match a character and his derivatives) with the unicode (u) flag. – Nicolas Nov 06 '16 at 20:55
  • @Croutonix I need the diacritics for later. Is it possible to remove the diacritics and later paste ti again? What does ´\pL´or ´ \pM´ have for a function? –  Nov 07 '16 at 14:24
  • @CasimiretHippolyte the full code is in. –  Nov 07 '16 at 14:33
  • @ChrisWinterbottom You could assign your text to a new string then remove the diacritics, run the regex and for each matches, find a substring of original string with match's start and end. – Nicolas Nov 07 '16 at 20:48

1 Answers1

0

An accented letter can be figured in several ways in unicode. For example ā can be the unicode code point U+0101 (LATIN SMALL LETTER A WITH MACRON), but it can be also the combination of U+0061 (LATIN SMALL LETTER A) and U+0304 (COMBINING MACRON). (link)

Consequence, writing (?<=[aeiouāēīōūăĕĭŏŭ]) is correct if:

  • you use the u modifier to inform the pcre regex engine that your string and your pattern must be read as UTF-8 strings. Otherwise multi-byte characters are seen as separated bytes and not as something atomic (This can be problematic and produce weird results in particular when multibyte characters are inside a character class. For example [eā]+ will match "ē").

  • you are sure that the target string and the pattern use the same form for each letter. If the pattern use U+0101 and the string U+0061 with U+0304 for "ā", it will not work. To prevent this problem, you can apply $str = Normalizer::normalize($str); to the subject string. This method comes from the intl extension.

You can find more informations following these links:

https://en.wikipedia.org/wiki/Unicode_equivalence
http://utf8-chartable.de/
http://php.net/manual/en/normalizer.normalize.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://pcre.org/original/pcre.txt

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • how can I use the u modifier. I tried to it like that `$result = preg_replace($hyphenation, '-$1',$input."u");` –  Nov 21 '16 at 14:01
  • @ChrisWinterbottom: no, you need to put it with the other modifiers x and i at the end of the pattern, after the delimiter. – Casimir et Hippolyte Nov 21 '16 at 14:07