Turning a string of chemical elements in a list containing the separate atoms

Question

I'm trying to create a list from a string that contains a chemical formula as in the following example:

structuralFormula1 = 'OCaOSeOO'

In the list I want all the chemical atoms to be separated from each other, like this:

structuralFormula1_list = ['O', 'Ca', 'O', 'Se', 'O', 'O']

I have no idea how to begin solving this problem. Any tips would be very much appreciated! Thanks in advance :)

just do a regex findall on capital letter, optional lower case letter — Sayse, Nov 26 '21 at 10:38
Does https://stackoverflow.com/questions/62296317/python-split-string-by-multiple-specific-characters-and-keep-the-delimiters answers your question? — Dani Mesejo, Nov 26 '21 at 10:40

Ashok Arora · Accepted Answer · 2021-11-26T12:56:38.183

3

Using re.findall:

>>> import re
>>> ex = 'OCaOSeOO'
>>> re.findall('[A-Z][a-z]?', ex)
['O', 'Ca', 'O', 'Se', 'O', 'O']

Regex explanation:

[A-Z] : Match uppercase letter
[a-z] : Match lowercase letter
[A-Z][a-z] : Match uppercase letter followed by lowercase letter
[A-Z][a-z]? : Match uppercase letter followed by 0 or 1 lowercase letters
[A-Z][a-z]* : Match uppercase letter followed by 0 or more lowercase letters. (* is greedy, it will match as many as found)

>>> import re
>>> ex = 'OCaOSeOO'
>>> re.findall('[A-Z]', ex)
['O', 'C', 'O', 'S', 'O', 'O']
>>> re.findall('[A-Z][a-z]', ex)
['Ca', 'Se']
>>> re.findall('[A-Z][a-z]?', ex)
['O', 'Ca', 'O', 'Se', 'O', 'O']
>>>

Edit #1: For another example mentioned in the comments:

>>> ex = 'ABbCccDdddEeeeeFfffffGggggggg'
>>> re.findall('[A-Z][a-z]*', ex)
['A', 'Bb', 'Ccc', 'Dddd', 'Eeeee', 'Ffffff', 'Gggggggg']

edited Nov 26 '21 at 12:56

answered Nov 26 '21 at 10:43

Ashok Arora

531
1
6
17

Thank you so much for this explanation. I had never heard about this "re" module. I will try to incorporate your code into mine and see if it works. – Fedra Herman Nov 26 '21 at 12:07
What if I would have chemical elements consisting out of 3 letters? How can I change the number of letters of each atom in my list? For example, if in the code you wrote ex = 'ABbCccDdddEeeeeFfffffGggggggg'. The answer would need to be {'A': 1, 'Bb': 1, 'Ccc': 1, 'Dddd': 1, 'Eeeee': 1, 'Ffffff': 1, 'Gggggggg': 1} Is this something I can also include in the code? – Fedra Herman Nov 26 '21 at 12:30
1

@FedraHerman I have edited the answer to include the example that you mentioned. Notice the `*` and `?` characters. I have also added explanation for the same. – Ashok Arora Nov 26 '21 at 12:57
Thanks for your help. Your code seems to do the trick. I will incorporate it into my own code :) – Fedra Herman Nov 27 '21 at 12:34

score 1 · Answer 2 · answered Nov 26 '21 at 10:45

1

use re.findall

import re
re.findall('.[^A-Z]*', 'OCaOSeOO')

Output:

['O', 'Ca', 'O', 'Se', 'O', 'O']

answered Nov 26 '21 at 10:45

Yefet

2,010
1
10
19

score 0 · Answer 3 · answered Nov 26 '21 at 10:42

You can you re.split with lookarounds:

import re
structuralFormula1 = 'OCaOSeOO'
print(re.split(r'(?<!^)(?=[A-Z])', structuralFormula1))

It splits at every position

Not preceded by the beginning of the string: (?<!^) => this avoids a leading empty match
Followed by an uppercase letter: (?=[A-Z])

Output:

['O', 'Ca', 'O', 'Se', 'O', 'O']

Turning a string of chemical elements in a list containing the separate atoms

3 Answers3