1

I need to url encode parts of the string that do not match a regex. Current solution (below) is:

  1. to select what regex I match (##.*##)
  2. put found substrings in a list and replace them with some not encodable indexes ~~1~~
  3. encode everything (entire url)
  4. put back the elements I found

I have this code that works. But I'm sure it could be done better, with a single parse looking for parts of the strings not matching my regex. It adds a huge overhead doing this everytime.

import re
from itertools import count
import urllib.parse

def replace_parts(url):
    parts = []
    counter = count(0)
    def replace_to(match):
        match = match.group(0)
        parts.append(match)
        return '~~' + str(next(counter)) + '~~'
        
    def replace_from(match):
        return parts[next(counter)]
    
    url = re.sub(r'##(.*?)##', replace_to, url)
    url = urllib.parse.quote(url)

    counter = count(0)
    url = re.sub(r'~~([0-9]+)~~', replace_from, url)
    print (url)

url1 = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
url = replace_parts(url1)
# this becomes http%3A%2F%2Fgoogle.com%3Fthis_is_my_encodedurl##somethin##%0A%26email%3D##other##tr
Alexandru R
  • 8,560
  • 16
  • 64
  • 98

2 Answers2

1

You could use re.sub to match the ##.*?## pattern, but also the text that preceded it, so that you have both categories of text as a pair. Then apply the URL encoding only on the first part in the callback function. To deal with the ending of the input, allow the second part to be either the ##.*?## pattern or the end of the input ($):

def replace_parts(url):
    return re.sub(r'(.*?)(##.*?##|$)', 
                  lambda m: urllib.parse.quote(m[1]) + m[2], 
                  url)
trincot
  • 317,000
  • 35
  • 244
  • 286
  • The pattern itself could be compiled using `re.compile` and added to the function: `replace_parts.pattern = re.compile(...)`, then use the pattern in the function. – Gábor Fekete Oct 26 '22 at 14:53
  • @GáborFekete, sure, but see [this answer](https://stackoverflow.com/a/452143/5459839) which states that *"Python internally compiles AND CACHES regexes"*. – trincot Oct 26 '22 at 15:00
  • maybe, but it's still a good practice to compile them, as I see there is some maximum limit of the cache, so you are better off compiling them anyway. – Gábor Fekete Oct 26 '22 at 15:03
0

Another option using a re.sub with a lambda using a capture group and a match with an alternation.

In the lambda check if capture group 1 exists. If it does, apply urllib.parse.quot and then return it. If there is no group 1, then return the match.

See a regex demo for the matches and groups.

The pattern matches

  • ##\S*?## Match as few non whitespace chars as possible between ##
  • | Or
  • ((?:(?!##.*?##)\S)+) Capture in group 1 a sequence of chars that are not directly followed by ##...##

Example

import re
import urllib.parse

pattern = r"##\S*?##|((?:(?!##.*?##)\S)+)"

def replace_parts(url):
    return re.sub(
        pattern,
        lambda m: urllib.parse.quote(m[1]) if m[1] else m[0],
        url
    )


s = "http://google.com?this_is_my_encodedurl##somethin##&email=##other##tr"
print(replace_parts(s))

Output

http%3A//google.com%3Fthis_is_my_encodedurl##somethin##%26email%3D##other##tr
The fourth bird
  • 154,723
  • 16
  • 55
  • 70