How to use re.sub(), or similar, to do replacements and generate raw strings without the metacharacters causing problems with the regex engine?

Question

import re

personal_pronoun = "se les"   #example 1
personal_pronoun = "se le"    #example 2
personal_pronoun = "se   le"  #example 3
personal_pronoun = "les"      #example 4
personal_pronoun = "le"       #example 5

#re.match() only matches at the beginning of the string
if re.match(r"se", personal_pronoun): 
    #concatenate this regex "negative look behind" to make a conditional negative match
    personal_pronoun_for_regex = re.sub(r"^se", r"(?<!se\s)se", personal_pronoun)
else: 
    personal_pronoun_for_regex = personal_pronoun

#re.search() searches for matches anywhere in the string.
if re.search(r"\s*le$", personal_pronoun_for_regex): 
    #concatenate the \b metacharacter representing a word boundary
    personal_pronoun_for_regex = re.sub(r"le$", r"le\b", personal_pronoun_for_regex)

#I check how the raw string looks like before using it in a regex
print(repr(personal_pronoun_for_regex)) # --> output raw string

This code give me that error raise s.error('bad escape %s' % this, len(this)) re.error: bad escape \s at position 6

What could I do to get these raw strings into the personal_pronoun_for_regex variable without having these re errors?

I think this is because there is an error within the re.sub() functions, causing a re.error object to be raised indicating that there was a problem processing the replacing regular expression.

This is how the raw string, so that special characters are interpreted literally as part of the regular expression, should actually look like:

personal_pronoun_for_regex = r"se les"           #for example 1
personal_pronoun_for_regex = r"se le\b"          #for example 2
personal_pronoun_for_regex = r"se   le\b"        #for example 3
personal_pronoun_for_regex = r"(?<!se\s)les"     #for example 4
personal_pronoun_for_regex = r"(?<!se\s)le\b"    #for example 5

I told you to use `.replace('\\', '\\\\')`, didn't it work? Just double the backslashes. — Wiktor Stribiżew, Mar 02 '23 at 14:41
@WiktorStribiżew I tried to do that, but the problem is that in the end it is necessary that the raw strings remain as they are indicated in the output of the question, if I do eo I would then have to do a deletion of consecutive characters — Matt095, Mar 02 '23 at 14:44
The result will be exactly what you need. If you need the original string, use the right variable. — Wiktor Stribiżew, Mar 02 '23 at 14:45
@WiktorStribiżew Soory, but I think this question was closed too quickly, I even managed to answer the comment you made — Matt095, Mar 02 '23 at 14:46
@WiktorStribiżew should i remove the raw string format `r` before doing that? In this lines like this: `personal_pronoun_for_regex = re.sub(r"^se", "(?<!se\\s)se", personal_pronoun)` and like this `personal_pronoun_for_regex = re.sub(r"le$", r"le\\b", personal_pronoun_for_regex)` — Matt095, Mar 02 '23 at 14:49
As I said, use double backslashes to insert one backslash, it is the third time I repeat it. BUT: this is a general solution. In your code, all you need is `startswith` and `endswith`: https://ideone.com/kulwUu — Wiktor Stribiżew, Mar 02 '23 at 14:51
And stop using `repr(...)` - no idea how many times I repeated it. It confuses you and everyone. Regex is **text**. You must know the exact regex **text** to know what it matches. String literals are for the code, let Python read them. — Wiktor Stribiżew, Mar 02 '23 at 14:51
@WiktorStribiżew I use repr only for testing, as it helps me decide if I need to use a .strip() or if there are \n or \t line breaks in between — Matt095, Mar 02 '23 at 14:52
And then you suddenly copy the string literal and start using it as a real text and then make more mistakes... Anyway, you can use anything in your code, but on SO, it will confuse a lot of users who would want to try and help you. — Wiktor Stribiżew, Mar 02 '23 at 14:54
@WiktorStribiżew I try with your code, but this give me a wrong output, check this https://www.online-python.com/ad9rOim5xJ — Matt095, Mar 02 '23 at 14:59
That is because `le` does not start with `se`. Please check your logic. — Wiktor Stribiżew, Mar 02 '23 at 15:19

How to use re.sub(), or similar, to do replacements and generate raw strings without the metacharacters causing problems with the regex engine?

0 Answers0