4

I am investigating a regexp mystery. I am tired so I may be missing something obvious - but I can't see any reason for this.

In the examples below, I use perl - but I first saw this in VIM, so I am guessing it is something related to more than one regexp-engines.

Assume we have this file:

$ cat data
1 =2   3 =4
5 =6  7 =8

We can then delete the whitespace in front of the '=' with...

$ cat data | perl -ne 's,(.)\s+=(.),\1=\2,g; print;'
1=2   3=4
5=6  7=8

Notice that in every line, all instances of the match are replaced ; we used the /g search modifier, which doesn't stop at the first replace, and instead goes on replacing till the end of the line.

For example, both the space before the '=2' and the space before the '=4' were removed ; in the same line.

Why not use simpler constructs like 's, =,=,g'? Well, we were preparing for more difficult scenarios... where the right-hand side of the assignments are quoted strings, and can be either single or double-quoted:

$ cat data2
1 ="2"   3 ='4 ='
5 ='6'  7 ="8"

To do the same work (remove the whitespace before the equal sign), we have to be careful, since the strings may contain the equal sign - so we mark the first quote we see, and look for it via back-references:

$ cat data2 | perl -ne 's,(.)\s+=(.)([^\2]*)\2,\1=\2\3\2,g; print;'
1="2"   3='4 ='
5='6'  7="8"

We used the back-reference \2 to search for anything that is not the same quote as the one we first saw, any number of times ([^\2]*). We then searched for the original quote itself (\2). If found, we used back references to refer to the matched parts in the replace target.

Now look at this:

$ cat data3 
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

What we want here, is to drop the last space character that exists before all the instances of '=' in every line. Like before, we can't use a simple 's, =",=",g', because the strings themselves may contain the equal sign.

So we follow the same pattern as we did above, and use back-references:

$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,g; print;"
posAndWidth="40:5 ="   height        ="1"
posAndWidth="-1:8 ='"  textAlignment ="Right"

It works... but only on the first match of the line! The space following 'textAlignment' was not removed, and neither was the one on top of it (the 'height' one).

Basically, it seems that /g is not functional anymore: running the same replace command without /g produces exactly the same output:

$ cat data3 | perl -ne "s,(\w+)(\s*) =(['\"])([^\3]*)\3,\1\2=\3\4\3,; print;"
posAndWidth="40:5 ="   height        ="1"
posAndWidth="-1:8 ='"  textAlignment ="Right"

It appears that in this regexp, the /g is ignored. Any ideas why?

ttsiodras
  • 10,602
  • 6
  • 55
  • 71
  • 2
    Isn't it treating everything between the first quote and the last quote as a quoted string? – Nick Mar 08 '13 at 14:55
  • 1
    The [^\3]* part can't go on matching beyond the closing quote, can it? – ttsiodras Mar 08 '13 at 14:59
  • with your perl cmd, I got different result `posAndWidth="40:5="` space between `5` and `=` was gone. – Kent Mar 08 '13 at 15:02
  • What happens when you remove the first entry from every line? Is the second one matched then? If it is, you have a problem with your anchorage. If it's not, it's a problem with the RegEx. On first glance, I can't see a flaw in them, too. – 0xCAFEBABE Mar 08 '13 at 15:17
  • 1
    Get [this table](https://gist.github.com/briandfoy/1342877) of what Perl backslash escapes mean in various contexts and versions. The short story is that a backslash before 1–3 digits within a character class is an octal number, so your `\3` is `\cC` or `\x03` or `\x{0003}` — in other words, it is a Control-C when used within a character class. – tchrist Mar 09 '13 at 14:11

2 Answers2

3

Inserting some debug characters in your substitution sheds some light on the issue:

use strict;
use warnings;

while (<DATA>) {
    s,(\w+)(\s*) =(['"])([^\3]*)\3,$1$2=$3<$4>$3,g;
    print;                       #  here -^ -^
}

__DATA__
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

Output:

posAndWidth="<40:5 ="   height        ="1>"
posAndWidth="<-1:8 ='"  textAlignment ="Right>"
#            ^--------- match ---------------^

Note that the match goes through both quotes at once. It would seem that [^\3]* does not do what you think it does.

Regex is not the best tool here. Use a parser that can handle quoted strings, such as Text::ParseWords:

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @a = quotewords('\s+', 1, $_);
    print Dumper \@a;
    print "@a\n";
}

__DATA__
posAndWidth ="40:5 ="   height        ="1"
posAndWidth ="-1:8 ='"  textAlignment ="Right"

Output:

$VAR1 = [
          'posAndWidth',
          '="40:5 ="',
          'height',
          '="1"'
        ];
posAndWidth ="40:5 =" height ="1"
$VAR1 = [
          'posAndWidth',
          '="-1:8 =\'"',
          'textAlignment',
          '="Right"'
        ];
posAndWidth ="-1:8 ='" textAlignment ="Right"

I included the Dumper output so you can see how the strings are split.

TLP
  • 66,756
  • 10
  • 92
  • 149
  • If [^\3]* is not doing what I thing... then what exactly does it do? It should match any character except the quote that we begun with - so it should stop in the first closing quote. Is this a bug in the regex engine? – ttsiodras Mar 08 '13 at 15:25
  • @ttsiodras Inside a character class, I doubt that meta characters work. In which case you are trying to negate `\3`, whatever that turns into. Have you tried `use re 'debug'`? – TLP Mar 08 '13 at 15:32
  • 3
    A discussion on negative back reference: http://www.perlmonks.org/?node_id=747135 It explains that the `[^...]` does not work with back reference as TLP expected. – cooltea Mar 08 '13 at 15:36
1

I will elaborate on my comment to TLP's answer:

ttsiodras you are asking two questions:

1- why does your regex not produce the desired result? why does the g flag not work?

The answer is because your regular expression contains this part [^\3] which is not handled correctly: \3 is not recognised as a back reference. I looked for it but could not find a way to have a back reference in character class.

2- how do you remove the space preceding an equal sign and leave alone the part that comes after and is between quotes?

This would be a way to do it (see this reference):

$ cat data3 | perl -pe "s,(([\"']).*?\2)| (=),\1\3,g"
posAndWidth="40:5 ="   height       ="1"
posAndWidth="-1:8 ='"  textAlignment="Right"

The 1st part of the regex catches whatever is between quotes (single or double) and is replaced by the match, the second part corresponds to the equal sign preceded by a space that you are looking for. Please note that this solution is only a work around the "interesting" part about the complement character class operator with back reference [^\3] by using the non-greedy operator *?


Finally if you want to pursue on the negative lookahead solution:

$ cat data3 | perl -pe 's,(\w+)(\s*) =(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g'
posAndWidth="40:5 ="   height       ="1"
posAndWidth="-1:8 ='"  textAlignment="Right"

The part with the quotes between square brackets still means "[\"']" but I had to use single quotes around the whole perl command otherwise the negative lookahead (?!...) syntax returns an error in bash.

EDIT Corrected the regex with negative lookahead: notice the non-greedy operator *? again and the g flag.

EDIT Took ttsiodras's comment into account: removed the non-greedy operator.

EDIT Took TLP's comment into account

Community
  • 1
  • 1
cooltea
  • 1,113
  • 7
  • 16
  • The second part of your answer (the negative back reference that I started with) is not working - it removes the space of the first equal sign only... – ttsiodras Mar 09 '13 at 07:22
  • That's true I need to look further into this. – cooltea Mar 09 '13 at 09:05
  • Ok I have corrected the second regex, thought it would take me longer. – cooltea Mar 09 '13 at 09:11
  • 1
    `perl -ne ' ... print;'` is the long version of `perl -pe ' ... '` – TLP Mar 09 '13 at 11:42
  • 1
    Perfect, thanks. Just to make sure, though: the link to the SO question you added uses ((?!\3).)* - you use (?!\3).* and I am not sure if your form is correct... in theory your form could match a non-quote and then proceed to match whatever. – ttsiodras Mar 09 '13 at 12:24
  • 2
    Verified - no need for non-greedy stars: perl -pe 's,(\w+)(\s*) =\s*(["'"'"'])((?:(?!\3).)*)\3,\1\2=\3\4\3,g' – ttsiodras Mar 09 '13 at 12:36
  • 1
    You are correct that backrefs are not expanded in character classes. See [this table](http://blogs.perl.org/users/brian_d_foy/2011/11/perl-regex-escapes-by-version.html), downloadable [here](https://gist.github.com/briandfoy/1342877). The short story is that it is an octal number, so `\3` is Control-C in a character class. – tchrist Mar 09 '13 at 14:09