Python: Replace string parts that DO NOT match specific regex

Question

I need to url encode parts of the string that do not match a regex. Current solution (below) is:

to select what regex I match (##.*##)
put found substrings in a list and replace them with some not encodable indexes ~~1~~
encode everything (entire url)
put back the elements I found

I have this code that works. But I'm sure it could be done better, with a single parse looking for parts of the strings not matching my regex. It adds a huge overhead doing this everytime.

import re
from itertools import count
import urllib.parse

def replace_parts(url):
    parts = []
    counter = count(0)
    def replace_to(match):
        match = match.group(0)
        parts.append(match)
        return '~~' + str(next(counter)) + '~~'
        
    def replace_from(match):
        return parts[next(counter)]
    
    url = re.sub(r'##(.*?)##', replace_to, url)
    url = urllib.parse.quote(url)

    counter = count(0)
    url = re.sub(r'~~([0-9]+)~~', replace_from, url)
    print (url)

url1 = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
url = replace_parts(url1)
# this becomes http%3A%2F%2Fgoogle.com%3Fthis_is_my_encodedurl##somethin##%0A%26email%3D##other##tr

You should also include what the URL looks like after the replacements. — Tim Biegeleisen, Oct 26 '22 at 14:14
You can use `re.split` to split the string by matches and the replace things in between matches. — matszwecja, Oct 26 '22 at 14:20

score 1 · Answer 1 · answered Oct 26 '22 at 14:48

1

You could use re.sub to match the ##.*?## pattern, but also the text that preceded it, so that you have both categories of text as a pair. Then apply the URL encoding only on the first part in the callback function. To deal with the ending of the input, allow the second part to be either the ##.*?## pattern or the end of the input ($):

def replace_parts(url):
    return re.sub(r'(.*?)(##.*?##|$)', 
                  lambda m: urllib.parse.quote(m[1]) + m[2], 
                  url)

answered Oct 26 '22 at 14:48

trincot

317,000
35
244
286

The pattern itself could be compiled using `re.compile` and added to the function: `replace_parts.pattern = re.compile(...)`, then use the pattern in the function. – Gábor Fekete Oct 26 '22 at 14:53
@GáborFekete, sure, but see [this answer](https://stackoverflow.com/a/452143/5459839) which states that *"Python internally compiles AND CACHES regexes"*. – trincot Oct 26 '22 at 15:00
maybe, but it's still a good practice to compile them, as I see there is some maximum limit of the cache, so you are better off compiling them anyway. – Gábor Fekete Oct 26 '22 at 15:03

score 0 · Answer 2 · answered Oct 26 '22 at 15:42

Another option using a re.sub with a lambda using a capture group and a match with an alternation.

In the lambda check if capture group 1 exists. If it does, apply urllib.parse.quot and then return it. If there is no group 1, then return the match.

See a regex demo for the matches and groups.

The pattern matches

##\S*?## Match as few non whitespace chars as possible between ##
| Or
((?:(?!##.*?##)\S)+) Capture in group 1 a sequence of chars that are not directly followed by ##...##

Example

import re
import urllib.parse

pattern = r"##\S*?##|((?:(?!##.*?##)\S)+)"

def replace_parts(url):
    return re.sub(
        pattern,
        lambda m: urllib.parse.quote(m[1]) if m[1] else m[0],
        url
    )


s = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
print(replace_parts(s))

Output

http%3A//google.com%3Fthis_is_my_encodedurl##somethin##%26email%3D##other##tr

Python: Replace string parts that DO NOT match specific regex

2 Answers2