4

I've come across following materials:

  1. Mastering Perl by brian d foy, chapter: Debugging Regular Expressions.
  2. Debugging regular expressions which mentions re::debug module for perl

I've also try to use various another techniques:

  1. Module re=debugcolor which highlights it's output.

  2. Used following construction ?{print "$1 $2\n"}.

but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of use re 'debug' or another command used for debugging regular expressions in perl?

EDIT in reply to Borodin:

1st example:

perl -Mre=debug -e' "foobar"=~/(.)\1/'
Compiling REx "(.)\1"
Final program:
   1: OPEN1 (3)
   3:   REG_ANY (4)
   4: CLOSE1 (6)
   6: REF1 (8)
   8: END (0)
minlen 1
Matching REx "(.)\1" against "foobar"
   0 <> <foobar>             |  1:OPEN1(3)
   0 <> <foobar>             |  3:REG_ANY(4)
   1 <f> <oobar>             |  4:CLOSE1(6)
   1 <f> <oobar>             |  6:REF1(8)
                                  failed...
   1 <f> <oobar>             |  1:OPEN1(3)
   1 <f> <oobar>             |  3:REG_ANY(4)
   2 <fo> <obar>             |  4:CLOSE1(6)
   2 <fo> <obar>             |  6:REF1(8)
   3 <foo> <bar>             |  8:END(0)
Match successful!
Freeing REx: "(.)\1"
  1. What does OPEN1, REG_ANY, CLOSE1 ... mean ?
  2. What numbers like 1 3 4 6 8 mean?
  3. What does number in braces OPEN1(3) mean?
  4. Which output should I look at, Compiling REx or Matching REx?

2nd example:

 perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/'
Compiling REx "(.*)\1"
Final program:
   1: OPEN1 (3)
   3:   STAR (5)
   4:     REG_ANY (0)
   5: CLOSE1 (7)
   7: REF1 (9)
   9: END (0)
minlen 0
Matching REx "(.*)\1" against "foobar"
   0 <foobar>|  1:OPEN1(3)
   0 <foobar>|  3:STAR(5)
                                  REG_ANY can match 6 times out of 2147483647...
   6 <foobar>|  5:  CLOSE1(7)
   6 <foobar>|  7:  REF1(9)
                                    failed...
   5 <foobar>|  5:  CLOSE1(7)
   5 <foobar>|  7:  REF1(9)
                                    failed...
   4 <foobar>|  5:  CLOSE1(7)
   4 <foobar>|  7:  REF1(9)
                                    failed...
   3 <foobar>|  5:  CLOSE1(7)
   3 <foobar>|  7:  REF1(9)
                                    failed...
   2 <foobar>|  5:  CLOSE1(7)
   2 <foobar>|  7:  REF1(9)
                                    failed...
   1 <foobar>|  5:  CLOSE1(7)
   1 <foobar>|  7:  REF1(9)
                                    failed...
   0 <foobar>|  5:  CLOSE1(7)
   0 <foobar>|  7:  REF1(9)
   0 <foobar>|  9:  END(0)
Match successful!
Freeing REx: "(.*)\1"
  1. Why are numbers descending 6 5 4 3 ... in this example?
  2. What does failed keyword mean?
Community
  • 1
  • 1
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122
  • 3
    Asking how to use a debugger is very broad. Can you show us the pattern that you are trying to debug, and explain what you don't understand? – Borodin Mar 20 '15 at 14:45
  • 3
    When you run `perl -Mre=debug`, you're using the `re` module; you can see the documentation by running `perldoc re`. The section on "debug mode" is a bit sparse, but ends with "See 'Debugging regular expressions' in perldebug for additional info." `perldoc perldebug` is similarly short on details, but ends with "These matters are explored in some detail in 'Debugging regular expressions' in perldebguts." And *now* we have [your answer](http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions). – ThisSuitIsBlackNot Mar 20 '15 at 15:38

2 Answers2

4

Regular expressions define finite state machines1. The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character.

"Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In /(.*)\1/:

1: OPEN1 (3)
3:   STAR (5)
4:     REG_ANY (0)
5: CLOSE1 (7)

STAR (5) means compute STAR and once you succeed, go to instruction 5 CLOSE1.

"Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.

To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for /(.)\1/. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.

               Match                           
+-------+     Anything     +----------+        
| Start +------------------+  State 1 |        
+---^---+                  +--+---+---+        
    |                         |   |            
    |                         |   |Matched same
    +-------------------------+   | character  
            matched different     |            
                character    +----+------+     
                             |  Success  |     
                             +-----------+   

We start on Start. It's easy to advance to the first state, we just consume any one character (REG_ANY). The only other thing that could happen is end of input. I haven't drawn that here. The REG_ANY instruction is wrapped in the capture group instructions. OPEN1 starts recording all matched characters into the first capture group. CLOSE1 stops recording characters to the first capture group.

Once we consume a character, we sit on State 1 and consume the next char. If it matches the previous char we move to success! REF1 is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to the Start to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).

The example with * is more complicated. * (which corresponds to STAR) tries to match the given pattern zero or more times, and it is greedy. That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"), closes the capture group, and tries to match "foobar" again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with \1 = "".

I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.

Here is a finite state machine simulator. You can enter a regex and see it executed. Unfortunately, it doesn't support back references.

1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.

axblount
  • 2,639
  • 23
  • 27
2

The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.

See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563