4

I'm currently somewhat stuck getting a regular expression in Perl (taken from an earlier question of mine) to match word characters from a non-ASCII locale (i.e., German umlauts).

I already tried various things such as setting the correct locale (using setlocale), converting data that I receive from MySQL to UTF8 (using decode_utf8), and so on... Unfortunately, to no avail. Google also did not help much.

Is there any chance to get the following regex locale-aware so that

$street = "Täststraße"; # I know that this is not orthographically correct
$street =~ s{
               \b (\w{0,3}) (\w*) \b
            }
            {
               $1 . ( '*' x length $2 )
            }gex;

ends up returning $street = "Täs*******" instead of "Tästs***ße"?

Community
  • 1
  • 1
Thilo-Alexander Ginkel
  • 6,898
  • 10
  • 45
  • 58

1 Answers1

6

I would expect that the regex result in "Täs*******". And this is what I get when I "use utf8" in a utf-8 encoded file with your code above.

(If everything is latin-1, that changes the behavior of the regex engine. Hence the existence of utf8::upgrade. See Unicode::Semantics.)

Edit: I see you fixed your post and that we agree on the expected result. Basically, use Unicode::Semantics when you want Unicode semantics on your regexps.

cjm
  • 61,471
  • 9
  • 126
  • 175
jrockway
  • 42,082
  • 9
  • 61
  • 86
  • That's weird... When run in a standalone fashion the code indeed works. It turns out that "use locale" broke things... Once I removed that everything went back to normal. – Thilo-Alexander Ginkel Oct 12 '09 at 08:06
  • Yeah, "use locale" should be avoided. "use utf8" for if you have utf8 literals in utf8-encoded source code. Otherwise, handle encoding with Encode, and use Unicode::Semantics when warranted. – jrockway Oct 12 '09 at 08:08
  • Is "use locale" a bad idea in all circumstances? Is it/should it be deprecated? – Ether Oct 12 '09 at 16:41
  • Depends on whether or not you want the behavior of your program to depend on the environment and random data in /usr/share/i18n/locales. If you need something to be locale dependent, why not just call the appropriate function directly? – jrockway Oct 13 '09 at 06:35
  • good advice about `use utf8`; I put in all my programs now. One should perhaps [warn about \b boundaries](http://stackoverflow.com/questions/4213800/is-there-something-like-a-counter-variable-in-regular-expression-replace/4214173#4214173) in patterns, though. They often surprise people! – tchrist Nov 18 '10 at 16:22