1

Runnign this snippet of PHP code:

preg_match("/^sito in (.*) \(([A-Z]{2})\)(.*)( CAP )?([0-9]{5})?$/U", "sito in Paternò (CT) Contrada Palazzolo, 28 CAP 95047", $matches);
var_dump(trim($matches[1]));

leads to this result:

string(8) "Paternò�"

(yes, there is a garbage character after the accented letter)

instead of the expected:

string(7) "Paternò"

How I can correctly extract words containing accented letters using preg_match?

Marco Marsala
  • 2,332
  • 5
  • 25
  • 39
  • possible duplicate of [How to allow utf-8 charset in preg\_match?](http://stackoverflow.com/questions/2934135/how-to-allow-utf-8-charset-in-preg-match) – Blackbam Sep 29 '15 at 14:25
  • Are you sure it's not `trim()` or `var_dump()` that cause the issue ? Because your regex seems pretty correct to me. You could add the `u` modifier to treat everything as UTF-8 though ... – tchap Sep 29 '15 at 14:26
  • @Blackbam @tchap already tried the `u` modifier without success. Already tried without the `trim`. I'm pretty sure it isn't a display issue because I tried comparing the strings `($matches[1] == 'Paternò'` is false) – Marco Marsala Sep 29 '15 at 14:56
  • 1
    Did you try a lowercase /u ? Cause i think an uppercase /U means something different (http://php.net/manual/en/reference.pcre.pattern.modifiers.php). – Blackbam Sep 29 '15 at 15:08
  • Yes I tried /U /u and /uU. The uppercase U is correct because I'm also using a non-greedy matching. @Blackbam – Marco Marsala Sep 29 '15 at 15:10
  • 1
    try to add `(*UTF8)` at the start of your regex : `preg_match("/(*UTF8)^sito in (.*) \(([A-Z]{2})\)(.*)( CAP )?([0-9]{5})?$/U", "sito in Paternò (CT) Contrada Palazzolo, 28 CAP 95047", $matches);` see http://stackoverflow.com/a/9473867/1741150 – tchap Sep 29 '15 at 15:25
  • @tchap already tried this too without success – Marco Marsala Sep 29 '15 at 16:05
  • Then the problem is somewhere else, since I cannot replicate on a standard PHP installation + your code. I get the expected result. PHP 5.3 – tchap Sep 30 '15 at 10:55
  • @MarcoMarsala I don't think it's a problem with the code. If you're printing to a page, what's the encoding?... If it's command line: http://stackoverflow.com/questions/3410424/command-line-character-encoding-from-phps-exec ... If on windows cmd: http://stackoverflow.com/questions/1650369/php-utf-8-to-windows-command-line-encoding and check [iconv_set_encoding](http://php.net/manual/en/function.iconv-set-encoding.php) – Mariano Oct 01 '15 at 07:54
  • I'm on Windows command line. Already using the chcp 65001 trick. Tried iconv_set_encoding but no luck! – Marco Marsala Oct 05 '15 at 13:41

0 Answers0