2

Sorry if this question seems too similar to other's I have found. This is a variation of using re.sub to replace exact characters in a string.

I have a string that looks like:

C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5

I would like to only replace, for example, the '*:1' with 'Ar'. My current attempt looks like this:

smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'
print(smiles_all)
new_smiles=re.sub('[*:]1','Ar',smiles_all)
print(new_smiles)
C1([*:5])C([*:6])C2=NC1=C([*Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*Ar0])C(=N4)C([*:3])=C5C([*Ar1])=C([*Ar2])C(=C2([*:4]))N5

As you can see, this is still changing the values that were previously 10,11, etc. I've tried different variations where I select [*:1], but that is also incorrect. Any help here would be greatly appreciated. In my current output, the * also remains. That needs to be swapped so that *:1 becomes Ar

Here is an example of what the output should be

C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5

*Edit:

At one point this question was flagged as answered by this question: Escaping regex string When I implement re.escape as suggested, I still get an error:

new_smiles=re.sub(re.escape('*:1'),'Ar',smiles_all)



C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5
C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([Ar0])C(=N4)C([*:3])=C5C([Ar1])=C([Ar2])C(=C2([*:4]))N5
  • 1
    @triplee duplicate is wrong, `*` is in character class here – Mustafa Aydın Jun 12 '21 at 20:23
  • 1
    @Mustafa The question plainly states that they want to match the string verbatim. – tripleee Jun 12 '21 at 20:28
  • The current title of the question is misleading. – Matthias Jun 12 '21 at 20:28
  • @tripleee not the brackets, though. `*:1` is being tried to match and brackets are being used for that purpose. But if you meant the brackets should be removed and therefore `*` should be escaped thereafter, or something else, pardon me. – Mustafa Aydın Jun 12 '21 at 20:41
  • 1
    The OP is complaining that `[*:10]` is replaced when it shouldn't. I continue to fail to see how your interpretation is possible. Granted, the question should probably be clearer, too. – tripleee Jun 12 '21 at 20:44
  • @tripleee Yes, they are unhappy about that but failing to escape is not the issue. A `+` and a positive lookahead, i.e., `re.sub(r"[*:]+1(?=])", "*Ar", val)` gives the desired output, no escaping. But I think you meant a solution sans the brackets hence the need for escaping the asterisk. – Mustafa Aydın Jun 12 '21 at 21:00
  • I've edited it to try to clarify what I want by including an example of the idea output. – German Barcenas Jun 12 '21 at 21:09
  • @MustafaAydın that still returns the * that I would like removed as well. The brackets need to stay in the string, but the * needs to be removed and replaced by the Ar characters – German Barcenas Jun 12 '21 at 21:16
  • 1
    @MustafaAydın You're correct. Sorry I noticed I have "*Ar" replacing *:, so it was adding an * after I was done. This answers it thank you! Would you like to add the answer comment so I can marked it answered, or I can edit to add the answer? – German Barcenas Jun 12 '21 at 21:21

2 Answers2

3

Given:

smiles_all='C1([*:5])C([*:6])C2=NC1=C([*:1])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'

desired='C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5'

You are trying to replace the literal string [*:1] with [Ar]. In a regex, the expression [*:1] is a character class that matches a single one of the characters inside the class with one match. If you add any regex repetition to a character class, it will match those characters in any order up to the repetition limit.

The easiest way to to replace the literal [*:1] with [Ar] is to use Python's string methods:

>>> smiles_all.replace('[*:1]','[Ar]')==desired 
True

If you want to use a regex, you need to escape those metacharaters to get a literal string:

>>> re.sub(r'\[\*:1\]', "[Ar]", smiles_all)==desired
True

Or let Python do the escaping for you:

>>> re.sub(re.escape(r'[*:1]'), "[Ar]", smiles_all)==desired
True
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Although Mustafa did answer, this answer of using .replace seems a little more user friendly and easier to swap out whenever I want to do things like change [*:2], or combinations later. Both are good though. – German Barcenas Jun 12 '21 at 21:26
  • The problem with Mustafa's approach is that it will match *any* combination of `:` and `*` inside a bracket followed by a `1`. See [HERE](https://regex101.com/r/YD2ctU/1) That may not be a problem, but it is not a literal string which I think is what you were after. – dawg Jun 12 '21 at 21:31
0

You can try:

re.sub(r"[*:]+1(?=])", "Ar", smiles_all)

Difference from yours is to allow 1+ repetitions of literal * and : followed by 1 which is also ensured to be followed by a ] via the ?=, i.e., positive lookahead.

to get

"C1([*:5])C([*:6])C2=NC1=C([Ar])C3=C([*:7])C([*:8])=C(N3)C([*:2])=C4C([*:9])=C([*:10])C(=N4)C([*:3])=C5C([*:11])=C([*:12])C(=C2([*:4]))N5"
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • This will match `[*:*:1]` and `[:*1]` See [HERE](https://regex101.com/r/YD2ctU/1) – dawg Jun 12 '21 at 21:33
  • 1
    @dawg yes, but from the (very strong) pattern of the string of OP, I thought it should do fine. Yours is the way to go, already upvoted `:)`. – Mustafa Aydın Jun 12 '21 at 21:34
  • 1
    I agree it is likely fine in this case; I do think it is a good practice to point out a potential gotcha to those just learning regex so that less hair is lost in the world. ;-) – dawg Jun 12 '21 at 21:39
  • 1
    It is also super easy to fix: Just use `(?<=\[)[*]:1(?=\])` or `(?<=\[)\*:1(?=\])` and then the proper literal string of `'[*:1]'` is matched without repetition. Faster and more accurate... – dawg Jun 12 '21 at 21:53