2

I am trying to extract a sub-string from a string using regular expressions. Below is the working code in Python (giving desired results)

Python Solution

x = r'CAR_2_ABC_547_d'
>>> spattern = re.compile("CAR_.*?_(.*)")
>>> spattern.search(x).group(1)
'ABC_547_d'
>>>

Perl Solution

$ echo "CAR_2_ABC_547_d" | perl -pe's/CAR_.*?_(.*)/$1/'
ABC_547_d

TCL Solution

However, when I try to utilize this approach in Tcl, it is giving me different results. Can someone please comment on this behavior

% regexp -inline "CAR_.*?_(.*)" "CAR_2_ABC_547_d"
CAR_2_ {}
sarbjit
  • 3,786
  • 9
  • 38
  • 60

2 Answers2

4

A branch has the same preference as the first quantified atom in it which has a preference.

So if you have .* as the first quantifier, the whole RE will be greedy, and if you have .*? as the first quantifier, the whole RE will be non-greedy.

Since you have used the .*? in the first place itself, the further expression follow lazy mode only.

If you add end of line $, then it will match the whole.

% regexp -inline "CAR_.*?_(.*)$" "CAR_2_ABC_547_d"
CAR_2_ABC_547_d ABC_547_d

Reference : re_syntax

Dinesh
  • 16,014
  • 23
  • 80
  • 122
  • my requirement is to extract the sub-string "ABC_547_d". I have deliberately used `?` to make it non greedy. I just want to understand why it works in Python and not TCL – sarbjit Sep 06 '16 at 11:09
  • This is with respect to `Tcl`. Since the first quantifier is non-greedy, the `.*` will only match empty string. If you have used `.+`, then it will give you the letter `A` alone. – Dinesh Sep 06 '16 at 11:13
  • Can you please suggest on how to achieve the desired result using Tcl regex with grouping method. I can see the original regex working in Perl and Python. Does TCL handles regex differently? – sarbjit Sep 06 '16 at 12:40
  • 1
    @sarbjit Python and Tcl use different regex engines. (Perl's is also different, but Python's is derived from the Perl one by a complicated route.) – Donal Fellows Sep 06 '16 at 14:59
  • @Dinesh: That is complete nonsense. `.*?` can't match the empty string because the pattern requires that it is followed by an underscore. – Borodin Sep 06 '16 at 14:59
  • @Borodin, Dinesh is referring to the capturing `(.*)` which, since it is implicitly non-greedy, will match the empty string following the second underscore. – glenn jackman Sep 06 '16 at 17:41
1

Another approach, instead of capturing the text that follows the prefix, is to just remove the prefix:

% set result [regsub {^CAR_.*?_} "CAR_2_ABC_547_d" {}]
ABC_547_d
glenn jackman
  • 238,783
  • 38
  • 220
  • 352