Regular Expression to split on specific character ONLY if that character is not in a pair

Question

After finding the fastest string replace algorithm in this thread, I've been trying to modify one of them to suit my needs, particularly this one by gnibbler.

I will explain the problem again here, and what issue I am having.

Say I have a string that looks like this:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

You'll notice a lot of locations in the string where there is an ampersand, followed by a character (such as "&y" and "&c"). I need to replace these characters with an appropriate value that I have in a dictionary, like so:

dict = {"y":"\033[0;30m",
        "c":"\033[0;31m",
        "b":"\033[0;32m",
        "Y":"\033[0;33m",
        "u":"\033[0;34m"}

Using gnibblers solution provided in my previous thread, I have this as my current solution:

myparts = tmp.split('&')
myparts[1:]=[dict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]
result = "".join(myparts)

This works for replacing the characters properly, and does not fail on characters that are not found. The only problem with this is that there is no simple way to actually keep an ampersand in the output. The easiest way I could think of would be to change my dictionary to contain:

dict = {"y":"\033[0;30m",
        "c":"\033[0;31m",
        "b":"\033[0;32m",
        "Y":"\033[0;33m",
        "u":"\033[0;34m",
        "&":"&"}

And change my "split" call to do a regex split on ampersands that are NOT followed by other ampersands.

>>> import re
>>> tmp = "&yI &creally &blove A && W &uRootbeer."
>>> tmp.split('&')
['', 'yI ', 'creally ', 'blove A ', '', ' W ', 'uRootbeer.']
>>> re.split('MyRegex', tmp)
['', 'yI ', 'creally ', 'blove A ', '&W ', 'uRootbeer.']

Basically, I need a Regex that will split on the first ampersand of a pair, and every single ampersand, to allow me to escape it via my dictionary.

If anyone has any better solutions please feel free to let me know.

Mike, I'm a little puzzled (though not personally hurt) why you don't use my solution from that other question. It turned out to be the fastest on real data, *does* have the property of keeping actual ampersands in the output, and is certainly among the most readable of the answers given. — Peter Hansen, Dec 20 '09 at 21:51
Peter: The reason for that is that I had not yet read your comment on why I was receiving the errors I was, and wasn't able to find a solution by the time I needed to write this code. Now that I see your comments, it's likely that I'll switch the code to use your faster, more readable solution. — Mike Trpcic, Dec 20 '09 at 23:07

score 2 · Accepted Answer · answered Dec 20 '09 at 20:20

2

You could use a negative lookbehind (assuming the regex engine in question supports it) to only match ampersands that do not follow another ampersand.

/(?<!&)&/

answered Dec 20 '09 at 20:20

Dav on a Plane

81
1

This worked perfectly. I don't know what kind of speed sacrifices I'm making by doing a lookbehind, so if anyone can come up with a more efficient solution (if it even exists), I'll be glad to hear it. – Mike Trpcic Dec 20 '09 at 20:34
As noted above in comment to your question, my solution is actually faster than gnibbler's even before you change it to use a regex split. In any case, I included test code with correct simulated input that should easily let you benchmark the performance change if you stick with this approach. – Peter Hansen Dec 20 '09 at 21:56

score 0 · Answer 2 · answered Dec 20 '09 at 20:21

0

Maybe loop while (q = str.find('&', p)) != -1, then append the left side (p + 2 to q - 1) and the replacement value.

answered Dec 20 '09 at 20:21

jspcal

50,847
7
72
76

score 0 · Answer 3 · answered Dec 20 '09 at 20:33

I think this does the trick:

import re

def fix(text):
    dict = {"y":"\033[0;30m",
            "c":"\033[0;31m",
            "b":"\033[0;32m",
            "Y":"\033[0;33m",
            "u":"\033[0;34m",
            "&":"&"}

    myparts = re.split('\&(\&*)', text)
    myparts[1:]=[dict.get(x[0],"&"+x[0])+x[1:] if len(x) > 0 else x for x in myparts[1:]]
    result = "".join(myparts)
    return result


print fix("The &yquick &cbrown &bfox &Yjumps over the &ulazy dog")
print fix("&yI &creally &blove A && W &uRootbeer.")

score 0 · Answer 4 · answered Dec 21 '09 at 00:19

re.sub will do what you want. It takes a regex pattern and can take a function to process the match and return the replacement. Below if the character following the & is not in the dictionary, no replacement is made. && is replaced with & to allow escaping an & that is followed by a character in the dictionary.

Also 'str' and 'dict' are bad variables names because they shadow the built-in functions of the same name.

In 's' below, '& cat' will not be affected and '&&cat' will become "&cat" suppressing &c translation.

import re

s = "The &yquick &cbrown &bfox & cat &&cat &Yjumps over the &ulazy dog"

D = {"y":"\033[0;30m",
     "c":"\033[0;31m",
     "b":"\033[0;32m",
     "Y":"\033[0;33m",
     "u":"\033[0;34m",
     "&":"&"}

def func(m):
    return D.get(m.group(1),m.group(0))

print repr(re.sub(r'&(.)',func,s))

OUTPUT:

'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox & cat &cat \x1b[0;33mjumps over the \x1b[0;34mlazy dog'

-Mark

Regular Expression to split on specific character ONLY if that character is not in a pair

4 Answers4

Linked