How to emulate word boundary when using unicode character properties?

Question

From my previous questions Why under locale-pragma word characters do not match? and How to change nested quotes I learnt that when dealing with UTF-8 data you can't trust \w as word-char and you must use the Unicode character property \p{Word}. Now I am in a situation where I found that zero-width word boundary \b also does not work with UTF-8 (with locale enabled), but I did not find any equivalent in Unicode character properties. I thought I may construct it myself like: (?<=\P{Word})(\p{Word}+)(?=\P{Word}), it should be equivalent to \b(\w+)\b.

In the test script below I have two arrays to test two different regexes. The first based on \b works fine when locale is not enabled. To get it to also work with locales I wrote another version with emulating boundary (?=\P{Word}) but it does not work as I expected (I show expected results in script too).

Do you see what is wrong and how to get emulated regex work as first with ASCII (or without locale)?

#!/usr/bin/perl

use 5.010;
use utf8::all;
use locale; # et_EE.UTF-8 in my case
$| = 1;

my @test_boundary = (  # EXPECTED RESULT:
  '"abc def"',         # '«abc def»'
  '"abc "d e f" ghi"', # '«abc «d e f» ghi»'
  '"abc "d e f""',     # '«abc «d e f»»'
  '"abc "d e f"',      # '«abc "d e f»'
  '"abc "d" "e" f"',   # '«abc «d» «e» f»'
  # below won't work with \b when locale enabled
  '"100 Естонiï"',     #  '«100 Естонiï»'
  '"äöõ "ä õ ü" ï"',   # '«äöõ «ä õ ü» ï»'
  '"äöõ "ä õ ü""',     # '«äöõ «ä õ ü»»'
  '"äöõ "ä õ ü"',      # '«äöõ «ä õ ü»'
  '"äöõ "ä" "õ" ï"',   # '«äöõ «ä» «õ» ï»'
);

my @test_emulate = (   # EXPECTED RESULT:
  '"100 Естонiï"',     # '«100 Естонiï»'
  '"äöõ "ä õ ü" ï"',   # '«äöõ «ä õ ü» ï»'
  '"äöõ "ä õ ü""',     # '«äöõ «ä õ ü»»'
  '"äöõ "ä õ ü"',      # '«äöõ "ä õ ü»'
  '"äöõ "ä" "õ" ï"',   # '«äöõ «ä» «õ» ï»'
);

say "BOUNDARY";
for my $sentence ( @test_boundary ) {
  my $quote_count = ( $sentence =~ tr/"/"/ );

  for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
    $sentence =~ s/
      "(                          # first qoute, start capture
        [\p{Word}\.]+?            # suva word-char
        .*?\b[\.,?!»]*?           # any char followed boundary + opt. punctuation
      )"                          # stop capture, ending quote
      /«$1»/xg;                   # change to fancy
  }
  say $sentence;
}

say "EMULATE";
for my $sentence ( @test_emulate ) {
  my $quote_count =  ( $sentence =~ tr/"/"/ );

  for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
    $sentence =~ s/
      "(                         # first qoute, start capture
      [\p{Word}\.]+?             # at least one word-char or point
      .*?(?=\P{Word})            # any char followed boundary 
      [\.,?!»]*?                 # optional punctuation
      )"                         # stop capture, ending quote
      /«$1»/gx;                  # change to fancy
  }
  say $sentence;
}

First, you are mistaken: `\w` and `\p{word}` are by definition identical. But second, ***please, please, please*** do not use the `use locale` pragma. It is broken, unreliable, unpredictable, and a royal pain in the butt — as you seem to have discovered. You should be using the `Unicode::Collate::Locale` module. You should probably not be using `use utf8:all` either, but rather doing the specific things that you want. — tchrist, Feb 23 '13 at 01:40
@tchrist: `\w` and `\p{Word}` may be defined identical, but they behave differently under `use locale`. Of course, i will not use locale when i have other way now. `use utf8::all` satisfies my needs pretty well and it is clean way to show my intentions. If there is some lack in utf8::all, maybe you could point it to the author? — w.k, Feb 23 '13 at 12:05
You don’t know what `utf8:all` does or does not do, which is precisely the problem — a problem, I note, that cannot be fixed by adding things to it. What level of `utf8` warnings do you get? None or warning or fatal? What about the three subtypes, the nonchar and the surrogates and the non_unicode? These things should be explicit in the code so people can see what they are. Then there is the issue of rendering into NFD on input and NFC on output; does it do that? — tchrist, Feb 23 '13 at 14:59
@tchrist: I understand `utf8::all` internals much better than `DBI`, `Unicode::Collate::Locale` or Perl core internals, but that does not stop me use them in my limited scope so well as I can. I have never worried about normalization, for example. When i will one day, i'll dig deeper. It is normal abstraction level for me - not to worry about underlying details when they don't make sense enough but work like expected. For time when `utf8::all` came up I already put most of the picture together myself too. But with using it I say: there is some "wheel" you all could use. I miss it being in core — w.k, Feb 24 '13 at 10:08

nhahtdh · Accepted Answer · 2013-02-19T13:01:22.587

7

Since the character after the position of the \b is either some punctuation or " (to be safe, please double check that \p{Word} does not match any of them), it falls into the case \b\W. Therefore, we can emulate \b with:

(?<=\p{Word})

I am not familiar with Perl, but from what I tested here, it seems that \w (and \b) also works nicely when the encoding is set to UTF-8.

$sentence =~ s/
  "(
    [\w\.]+?
    .*?\b[\.,?!»]*?
  )"
  /«$1»/xg;

If you move up to Perl 5.14 and above, you can set the character set to Unicode with u flag.

You can use this general strategy to construct a boundary corresponding to a character class. (Like how \b word boundary definition is based on the definition of \w).

Let C be the character class. We would like to define a boundary that is based on the character class C.

The construction below will emulate boundary in front when you know the current character belongs to C character class (equivalent to (\b\w)):

(?<!C)C

Or behind (equivalent to \w\b):

C(?!C)

Why negative look-around? Because positive look-around (with the complementary character class) will also assert that there must be a character ahead/behind (assert width ahead/behind at least 1). Negative look-around will allow for the case of beginning/ending of the string without writing a cumbersome regex.

For \B\w emulation:

(?<=C)C

and similarly \w\B:

C(?=C)

\B is the direct opposite of \b, therefore, we can just flip the positive/negative look-around to emulate the effect. It also makes sense - a non-boundary can only be formed when there are more character ahead/behind.

Other emulations (let c be the complement character class of C):

\b\W: (?<=C)c
\W\b: c(?=C)
\B\W: (?<!C)c
\W\B: c(?!C)

For the emulation of a standalone boundary (equivalent to \b):

(?:(?<!C)(?=C)|(?<=C)(?!C))

And standalone non-boundary (equivalent to \B):

(?:(?<!C)(?!C)|(?<=C)(?=C))

edited Feb 19 '13 at 13:01

answered Feb 18 '13 at 18:25

nhahtdh

55,989
15
126
162

Changing `\b` into `(?!\p{Word})` did not change the results. With testcase `'"äöõ "ä õ ü" ï"'` i get captured instead of `äöõ "ä õ ü` still `äöõ `, like with my positive lookaround. Could you point, what goes wrong? – w.k Feb 18 '13 at 21:13
@w.k: I am not sure what you are trying to do (bracket matching?). The problem is not about word boundary (and its emulation), but with the regex that you are currently having. – nhahtdh Feb 18 '13 at 23:04
My goal is to change pairs of double quotes`"äöõ"` into fancy quotes `«äöõ»`. On nested quotes it should replace not matching pairs but 1st and 3rd quote, then 2nd and 4th. My first regex works exactly as i expected when i don't enable locale. But i need locale too. So, in second regex only change i made is changing `\b` into `(?=\P{Word})` and after your suggestion into negative lookahead `(?!\p{Word})`. Those lookaheads don't work as `\b` did and i don't see why? – w.k Feb 19 '13 at 08:38
@w.k: I am not sure how you identify 2 quotes as a pair and change them into fancy quote. I am not sure you know what you are doing in your original regex: http://pastebin.com/DVUqzSYb – nhahtdh Feb 19 '13 at 12:14
Hmm, but the match on the line 46 is exactly what i am looking for. I want to get same match with emulated boundary too, but i can't so far. `$1` becomes string which i want to surround with fancy quotes. Where do you see problem? In too many repeats? – w.k Feb 19 '13 at 12:27
@w.k: Oh, I see your problem when I run your code on ideone (but there is limit to what I can do on ideone). Can you try `(?<=\p{Word})`? Since your case is equivalent to `\b\W`. – nhahtdh Feb 19 '13 at 12:33
Yes, it works. So, you checked that the previous char should be word char with this lookbehind. Could you please edit your answer to incorporate this information into it? – w.k Feb 19 '13 at 12:49
Thank you, accepted it. Problem with `\w` (and `b` ) rises only with `use locale` (which i need for sorting by my locale standards). When not enforcing locale they work fine, because then `\w` is equal to `\p{Word}` (if Unicode rules are in effect). More details on [perlrecharclass documentation](http://perldoc.perl.org/perlrecharclass.html#Word-characters) – w.k Feb 19 '13 at 13:34
@w.k ***Please, please, please*** do not use the `use locale` pragma. Use `Unicode::Collate::Locale`. – tchrist Feb 23 '13 at 01:38

score 5 · Answer 2 · answered Feb 18 '13 at 18:24

5

You should be using negative lookarounds:

(?<!\p{Word})(\p{Word}+)(?!\p{Word})

The positive lookarounds fail at the start or end of the string because they require a non-word character to be present. The negative lookarounds work in both cases.

answered Feb 18 '13 at 18:24

Tim Pietzcker

328,213
58
503
561

Isn’t that just like writing `\b(\w+)\b`? – tchrist Feb 23 '13 at 01:39
He’s messing things up with the icky/broken `use locale`; see [this answer](http://stackoverflow.com/a/15036072/471272) for how to do locale stuff in Perl the right way. That way you can just use normal regex things, too. – tchrist Feb 23 '13 at 05:27

How to emulate word boundary when using unicode character properties?

2 Answers2

Linked