7

I need to replace character (say) x with character (say) P in a string, but only if it is contained in a quoted substring. An example makes it clearer:

axbx'cxdxe'fxgh'ixj'k  -> axbx'cPdPe'fxgh'iPj'k

Let's assume, for the sake of simplicity, that quotes always come in pairs.

The obvious way is to just process the string one character at a time (a simple state machine approach);
however, I'm wondering if regular expressions can be used to do all the processing in one go.

My target language is C#, but I guess my question pertains to any language having builtin or library support for regular expressions.

Torsten Marek
  • 83,780
  • 21
  • 91
  • 98
Cristian Diaconescu
  • 34,633
  • 32
  • 143
  • 233

9 Answers9

9

I converted Greg Hewgill's python code to C# and it worked!

[Test]
public void ReplaceTextInQuotes()
{
  Assert.AreEqual("axbx'cPdPe'fxgh'iPj'k", 
    Regex.Replace("axbx'cxdxe'fxgh'ixj'k",
      @"x(?=[^']*'([^']|'[^']*')*$)", "P"));
}

That test passed.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
jop
  • 82,837
  • 10
  • 55
  • 52
8

I was able to do this with Python:

>>> import re
>>> re.sub(r"x(?=[^']*'([^']|'[^']*')*$)", "P", "axbx'cxdxe'fxgh'ixj'k")
"axbx'cPdPe'fxgh'iPj'k"

What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string.

This relies on your assumption that the quotes are always balanced. This is also not very efficient.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Consider also that it's the re.sub() function that iterates over the string. the regular expression itself only matches the first x within quotes. – Remo.D Sep 26 '08 at 10:33
  • I can't imagine how you would even represent the solution to this problem without using something like re.sub(). After all, a regular expression by itself only matches, and the original question asked about replacing. – Greg Hewgill Sep 26 '08 at 10:36
  • you're right: it can't be done witough some "extra machinery", that's why many answered that "plain regular expressions" where not powerful enough – Remo.D Sep 26 '08 at 10:44
  • I stole your code and converted into C#. I hope you don't mind. Thanks. :) – jop Sep 26 '08 at 11:08
  • I guess that's what I was looking for - a way to iterate over regex matches. Thanks! – Cristian Diaconescu Sep 26 '08 at 12:12
3

A more general (and simpler) solution which allows non-paired quotes.

  1. Find quoted string
  2. Replace 'x' by 'P' in the string

    #!/usr/bin/env python
    import re
    
    text = "axbx'cxdxe'fxgh'ixj'k"
    
    s = re.sub("'.*?'", lambda m: re.sub("x", "P", m.group(0)), text)
    
    print s == "axbx'cPdPe'fxgh'iPj'k", s
    # ->   True axbx'cPdPe'fxgh'iPj'k
    
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

The trick is to use non-capturing group to match the part of the string following the match (character x) we are searching for. Trying to match the string up to x will only find either the first or the last occurence, depending whether non-greedy quantifiers are used. Here's Greg's idea transposed to Tcl, with comments.

set strIn {axbx'cxdxe'fxgh'ixj'k}
set regex {(?x)                     # enable expanded syntax 
                                    # - allows comments, ignores whitespace
            x                       # the actual match
            (?=                     # non-matching group
                [^']*'              # match to end of current quoted substring
                                    ##
                                    ## assuming quotes are in pairs,
                                    ## make sure we actually were 
                                    ## inside a quoted substring
                                    ## by making sure the rest of the string 
                                    ## is what we expect it to be
                                    ##
                (
                    [^']*           # match any non-quoted substring
                    |               # ...or...
                    '[^']*'         # any quoted substring, including the quotes
                )*                  # any number of times
                $                   # until we run out of string :)
            )                       # end of non-matching group
}

#the same regular expression without the comments
set regexCondensed {(?x)x(?=[^']*'([^']|'[^']*')*$)}

set replRegex {P}
set nMatches [regsub -all -- $regex $strIn $replRegex strOut]
puts "$nMatches replacements. "
if {$nMatches > 0} {
    puts "Original: |$strIn|"
    puts "Result:   |$strOut|"
}
exit

This prints:

3 replacements. 
Original: |axbx'cxdxe'fxgh'ixj'k|
Result:   |axbx'cPdPe'fxgh'iPj'k|
Cristian Diaconescu
  • 34,633
  • 32
  • 143
  • 233
2
#!/usr/bin/perl -w

use strict;

# Break up the string.
# The spliting uses quotes
# as the delimiter.
# Put every broken substring
# into the @fields array.

my @fields;
while (<>) {
    @fields = split /'/, $_;
}

# For every substring indexed with an odd
# number, search for x and replace it
# with P.

my $count;
my $end = $#fields;
for ($count=0; $count < $end; $count++) {
    if ($count % 2 == 1) {
        $fields[$count] =~ s/a/P/g;
    }    
}

Wouldn't this chunk do the job?

1

Not with plain regexp. Regular expressions have no "memory" so they cannot distinguish between being "inside" or "outside" quotes.

You need something more powerful, for example using gema it would be straighforward:

'<repl>'=$0
repl:x=P
Remo.D
  • 16,122
  • 6
  • 43
  • 74
1

Similar discussion about balanced text replaces: Can regular expressions be used to match nested patterns?

Although you can try this in Vim, but it works well only if the string is on one line, and there's only one pair of 's.

:%s:\('[^']*\)x\([^']*'\):\1P\2:gci

If there's one more pair or even an unbalanced ', then it could fail. That's way I included the c a.k.a. confirm flag on the ex command.

The same can be done with sed, without the interaction - or with awk so you can add some interaction.

One possible solution is to break the lines on pairs of 's then you can do with vim solution.

Community
  • 1
  • 1
Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
  • I think this will only replace the first x that it found inside the quotes. The subsequent x's will be matched by the non-apostrophe pattern [^']* – jop Sep 26 '08 at 11:11
  • You are right. But it can be modified to replace all the x's. – Zsolt Botykai Sep 26 '08 at 11:19
1
Pattern:     (?s)\G((?:^[^']*'|(?<=.))(?:'[^']*'|[^'x]+)*+)x
Replacement: \1P
  1. \G — Anchor each match at the end of the previous one, or the start of the string.
  2. (?:^[^']*'|(?<=.)) — If it is at the beginning of the string, match up to the first quote.
  3. (?:'[^']*'|[^'x]+)*+ — Match any block of unquoted characters, or any (non-quote) characters up to an 'x'.

One sweep trough the source string, except for a single character look-behind.

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
0

Sorry to break your hopes, but you need a push-down automata to do that. There is more info here: Pushdown Automaton

In short, Regular expressions, which are finite state machines can only read and has no memory while pushdown automaton has a stack and manipulating capabilities.

Edit: spelling...

Tobias Wärre
  • 793
  • 5
  • 11