Python regex: how to achieve this complex replacement rule?

Question

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:

a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason

Well, here is what I tried (without any success).

Attempt 1:

s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)

I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code

re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)

returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.

Attempt 2:

s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)

This doesn't work as it returns a.b.c.d.e.f.gh.

Attempt 3:

s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)

This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.

Any suggestion?

Remember that a backslash is special both in strings and in regular expressions, so you need to escape it properly. — Some programmer dude, Mar 14 '18 at 12:40
Note that you don't need to escape `.` inside a character class. — 0x5453, Mar 14 '18 at 12:42
I can't simply do `re.sub('\\\.', '', s1)`, I updated the question in order to explain why. Many thanks and sorry. — Vanni Rovera, Mar 14 '18 at 13:06
@AvinashRaj your solution doesn't solve my problem. If I understand well, my problem is that `re.search` returns `'.'`, which without the backslash means to substitute each character, digit, symbol, etc. I rather need to substitute only the points, so I need `re.search` to return something like `\.`. — Vanni Rovera, Mar 14 '18 at 16:09
@Someprogrammerdude I'm using the backslash exactly because it is special: the point `.` alone would mean to match anything, whereas I need to match points only. So, for all I know, I need to use the backslash `\.` in order to specify this. — Vanni Rovera, Mar 14 '18 at 16:11
Your exposition is riddled by mixing these notations. The *regular expression* `r'\.'` can only match and return the *string* `'.'`. There is no `'\.'` anywhere in `s1` so it is completely impossible that any regular expression could find and return that. — tripleee, Mar 14 '18 at 20:08
I got the point here, the string `\.` doesn't occur in `s1`; only `.` does. What I can't figure out is how to make `re.sub` to replace the point only, without interpreting `.` in the usual sense of regexs. I mean, if I write `re.sub('.', '', s1)` it's obvious that it would have to replace anything with `''`, but I would like to tell it something like "replace `\.`, not `.`". But it seems like this is not possible... if I understand well, to me this looks like a missing feature. I mean, it's completely natural to request something like "find a point and replace it with something else". — Vanni Rovera, Mar 15 '18 at 08:30
`re.sub(r'\.', '', s1)` does what you are asking, in isolation; it replaces a literal dot with nothing. Equivalently, you could say `re.sub(r'[.]', '', s1)` — tripleee, Mar 15 '18 at 08:32
@tripleee of course, this way is trivial. But I need to use a more complex regex in order to achieve the generality I need. See below, under your answer. — Vanni Rovera, Mar 15 '18 at 12:18

tripleee · Accepted Answer · 2018-03-16T06:17:23.057

1

There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.

To find something, re.search lets you find where in a string that something occurs.

To replace that something, use re.sub on the same regular expression instead of re.search, not as well.

And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.

With that out of the way, for your regex, it sounds like you want

r'(?<![\s.:])[.:]+(?![\s.:])'   # updated from comments, thanks!

which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning¹), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).

Now, the regex only matches the part you want to actually replace, so you can do

>>> import re
>>> s1 = 'name.surname@domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname@domaincom'

though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:

>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname@domaincom'

See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.

You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.

¹ You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

edited Mar 16 '18 at 06:17

answered Mar 14 '18 at 13:22

tripleee

175,061
34
275
318

Thank you for your detailed answer, there are a lot of things I wasn't aware of. Still it isn't clear to me how `\g<1>\g<2>` works. I tried with `s2 = a.b.c.d.e.f.g.h` and `re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s2)` returns `a.b.c.d.e.f.gh`, as I expected based on what you said, but `re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s2)` returns a very strange `ab.cd.ef.gh`, whereas I expected `abcdefgh`. Why does it neglect the even occurrences of `.`? I read the [documentation](https://docs.python.org/2/library/re.html) but I found nothing useful here. – Vanni Rovera Mar 14 '18 at 17:04
Your regex always matches at least three characters; after a successful match and substitution, the next matching attempt starts imrediately after the text you have already matched. – tripleee Mar 14 '18 at 20:03
Wow... I didn't know it reasons this way :( now I'm going to figure out how to fix this. No idea at the moment, I'll let you know asap. Thank you – Vanni Rovera Mar 15 '18 at 08:10
It's not hard, `r'([^\s.:]*)[.:]+([^\s.:]*)'` without the `\S` inside the parentheses should do what you hope. – tripleee Mar 15 '18 at 08:31
No, this doesn't yield what I would like to achieve. I immediately drop this solution because it removes `.` even when it confines with a white space. That is, it works fine with `a.b` yielding `ab`, but it fails with `a .b` as it gives `a b`, whereas I would like `a .b`. I have to remove `.` and `:` only when they are not adjacent to any white space. This is why I have no idea at the moment, and this is why I am still shocked how `re.sub` works with matches of groups. – Vanni Rovera Mar 15 '18 at 12:16
I'm sorry, I don't think I understand where you are stuck. Perhaps you want to accept this answer and post a new question with more details; or maybe [edit] your question and ping me, and I'll try to update my answer. (I think the former is better, though.) – tripleee Mar 15 '18 at 12:21
I updated the question as the previous one was no longer significant. I hope the question looks clearer now. – Vanni Rovera Mar 15 '18 at 12:48
@VanniRovera It's a lot clearer now but I don't understand why the lookarounds I proposed as my actual answer are not working, or did you not try that? – tripleee Mar 15 '18 at 13:04
...I totally forgot this one. It's close to perfection, nonetheless it fails with `a..::.:::.:bcde.. ...:fg`: this yields `abcde. .fg` rather than `abcde.. ...:fg` (see the question). So I used `r'(?<![\s.:])[.:]+(?![\s.:])'` instead and this worked perfectly. Thank you very much. I just suggest you to include this modified version in your answer, so that anyone can see the right solution. – Vanni Rovera Mar 15 '18 at 23:39

Python regex: how to achieve this complex replacement rule?

1 Answers1