3

I'm interested in changing the regex word boundary \b to include other characters (for example, a . wouldn't count as a boundary). I understand that it is a boundary between \w and \W characters.

  my $_ = ".test";
  if ( /(\btest\b)/ ){
    print;
    print " $1\n";
  }
  if ( /((?:(?<=\W)|^)test(?:(?=\W)|$))/ ){
    print;
    print " $1\n";
  }

This is what I came up with, and all I'd have to do is change \W to something like [^\w.], but I still want to know how Perl interprets \b in a regular expression. I tried deparsing it like this:

my $deparser = B::Deparse->new("-sC", "-x10");

print $deparser->coderef2text( sub { 
          my $_ = ".test";
          if ( /(\btest\b)/ ){
            print;
            print " $1\n";
          }
          if ( /((?:(?<=\W)|^)test(?:(?=\W)|$))/ ){
            print;
            print " $1\n";
          }
       });

I was hoping it would expand \b into what it was equivalent to. What is \b equivalent to? Can you deparse \b or other expressions further somehow?

hmatt1
  • 4,939
  • 3
  • 30
  • 51

2 Answers2

9

\b is functionally equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w).

\B is functionally equivalent to (?<!\w)(?!\w)|(?<=\w)(?=\w).


The goal of Deparse is to produce a readable representation of Perl's understanding of the code. For example, f() and g(); and g() if f(); compile identically, so Deparse will give the more readable option, g() if f();, for both.

$ perl -MO=Deparse -e'f() and g()'
g() if f();
-e syntax OK

This means that if \b and (?<!\w)(?=\w)|(?<=\w)(?!\w) compiled to the same code, Deparse would still give you \b if it understood compiled regex. Deparse is not what you want.


Maybe you're thinking of Concise. It shows what really gets executed. Notice the use of and in the following even though the original Perl uses if:

$ perl -MO=Concise,-exec -e'g() if f()'
1  <0> enter
2  <;> nextstate(main 1 -e:1) v:{
3  <0> pushmark s
4  <#> gv[*f] s/EARLYCV
5  <1> entersub[t6] sKS/TARG
6  <|> and(other->7) vK/1
7      <0> pushmark s
8      <#> gv[*g] s/EARLYCV
9      <1> entersub[t3] vKS/TARG
a  <@> leave[1 ref] vKP/REFC
-e syntax OK

But like Deparse, Concise knows nothing of the regex program the regex engine created from the string. So this is still not what you want.


However, there is an equivalent of Concise for regex patterns: use re 'debug';.

$ perl -Mre=debug -E'qr/\b/'
Compiling REx "\b"
Final program:
   1: BOUNDU (2)
   2: END (0)
stclass BOUNDU minlen 0
Freeing REx: "\b"

Apparently, \b is implemented as its own operation. For comparison,

$ perl -Mre=debug -E'qr/(?<!\w)(?=\w)|(?<=\w)(?!\w)/'
Compiling REx "(?<!\w)(?=\w)|(?<=\w)(?!\w)"
Final program:
   1: BRANCH (12)
   2:   UNLESSM[-1] (7)
   4:     POSIXU[\w] (5)
   5:     SUCCEED (0)
   6:   TAIL (7)
   7:   IFMATCH[0] (23)
   9:     POSIXU[\w] (10)
  10:     SUCCEED (0)
  11:   TAIL (23)
  12: BRANCH (FAIL)
  13:   IFMATCH[-1] (18)
  15:     POSIXU[\w] (16)
  16:     SUCCEED (0)
  17:   TAIL (18)
  18:   UNLESSM[0] (23)
  20:     POSIXU[\w] (21)
  21:     SUCCEED (0)
  22:   TAIL (23)
  23: END (0)
minlen 0
Freeing REx: "(?<!\w)(?=\w)|(?<=\w)(?!\w)"
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • +1. For further information, the non-word boundary `\B` is functionally equivalent to the logical negation of `\b`, that is, the logical negation of `(?<!\w)(?=\w)|(?<=\w)(?!\w)` which -by applying De Morgan's law and simplifying- is equal to `(?<=\w)(?=\w)|(?<!\w)(?!\w)`. – mateleco Aug 06 '22 at 04:18
7

\b is the zero-width boundary between “a word character” (\w) and “not a word character” (which is something subtly different from “a non-word character” \W). So it would be equivalent to

(?<!\w)(?=\w)|(?<=\w)(?!\w)

Deparsing will not work here because regexes are a separate language directly embedded into the Perl code. You can use re 'debug' to see how regexes are compiled and matched (see the re docs for more info on how to use this).

amon
  • 57,091
  • 2
  • 89
  • 149
  • 2
    Even if Deparse was regex aware, it wouldn't do anything anyway since `\b` has its own regex opcode `perl -Mre=debug -E'qr/\b/'` vs `perl -Mre=debug -E'qr/(?<!\w)(?=\w)|(?<=\w)(?!\w)/'` – ikegami Apr 08 '14 at 19:20