1

Given the following regular expression, the goal of which is to capture the text preceding the items in the capturing group:

/cliente:[\sa-z.ñÑ0-9(),']+(?=((?:traslado|tr|giro|rut|rt)\:.*))/gmi

With the text string:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

And getting a successful result. But, if I insert a dot in the regex as follows:

/cliente:.+(?=((?:traslado|tr|giro|rut|rt)\:.*))/gmi

It breaks the capturing group, yielding

CLIENTE:NUBOX S.A. TRASLADO:CONSIGNACIONESRUT:25387TR:CONSIG

I need to know why this is happening.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Because the dot does not match newlines, while the \s in your character class does. – Jan Sep 23 '16 at 14:42
  • *In the first place I encourage you to read [Greedy vs Non-greedy](http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) question.* Substituting old pattern with a greedy dot plus `.+` pattern results in consuming all characters of input string when regex engine arrives at `.+`. So by passing this pattern you are sure that current cursor is at the end of line of the input. For engine all patterns should successfully match so it hits next pattern which is a + lookahead construct. Being at end of string there is no `traslado`, `rt` or ... – revo Sep 23 '16 at 14:44
  • ... other choices in alternations. So it backtracks (going one single step backward) to right before `G`. Then again lookahead doesn't match. It continues backtracking till comes before `TR`. Lookahead succeeds and engine satisfies. Period. – revo Sep 23 '16 at 14:44
  • 1
    The interesting difference between `[\sa-z.ñÑ0-9(),']+` and `.+` is that `[\sa-z.ñÑ0-9(),']+` doesn't match the colon `:` when `.+` does. – Casimir et Hippolyte Sep 23 '16 at 16:44

1 Answers1

0

You can see the difference in how they match by looking at the regex debugger on Regex101.

Let me explain how each regex matches.

Let's look at the first regex. The first part of that regex cliente:[\sa-z.ñÑ0-9(),']+ will initially match this:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

It stops at the :, but it has gone too far; it cannot match the lookahead. It must then backtrack character by character, seeing if it can match that lookahead:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
...
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

Now, let's look at the second regex. The first part of that regex cliente:.+ will match the entire line:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

Of course, this does not leave anything for the lookahead to match, so it must backtrack, character by character:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
...
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

Because you backtrack from a different spot, you get a different result. This behavior, matching as much as possible, is greedy. On the other hand, you have the option to make things lazy:

/cliente:.+?(?=((?:traslado|tr|giro|rut|rt)\:.*))/gmi

The first part of this altered regex cliente:.+? will match as little as possible at first:

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG

It then tries to match the lookahead. At this point it can't match, so it inches up a character and tries to match that lookahead again, repeating until it finds something (and returns the match) or no string is left (and it fails):

CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387 TR:CONSIG
...
CLIENTE:NUBOX S.A.TRASLADO:CONSIGNACIONESRUT:25387TR:CONSIG
Laurel
  • 5,965
  • 14
  • 31
  • 57
  • Why you say it fails? with this pattern (making it Lazy) I get what I want /cliente:.+?(?=((?:traslado|tr|giro|rut|rt)\:.*))|cliente:.+?/gmi **cliente:nubox . s.a** – Fabian Olmos Sep 26 '16 at 14:37
  • @FabianOlmos I've edited the wording a bit. Does it make a bit more sense now? – Laurel Sep 26 '16 at 14:44