1

I'm trying to create a list from a string that contains a chemical formula as in the following example:

structuralFormula1 = 'OCaOSeOO'

In the list I want all the chemical atoms to be separated from each other, like this:

structuralFormula1_list = ['O', 'Ca', 'O', 'Se', 'O', 'O']

I have no idea how to begin solving this problem. Any tips would be very much appreciated! Thanks in advance :)

Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • 2
    just do a regex findall on capital letter, optional lower case letter – Sayse Nov 26 '21 at 10:38
  • Does https://stackoverflow.com/questions/62296317/python-split-string-by-multiple-specific-characters-and-keep-the-delimiters answers your question? – Dani Mesejo Nov 26 '21 at 10:40

3 Answers3

3

Using re.findall:

>>> import re
>>> ex = 'OCaOSeOO'
>>> re.findall('[A-Z][a-z]?', ex)
['O', 'Ca', 'O', 'Se', 'O', 'O']

Regex explanation:

  • [A-Z] : Match uppercase letter
  • [a-z] : Match lowercase letter
  • [A-Z][a-z] : Match uppercase letter followed by lowercase letter
  • [A-Z][a-z]? : Match uppercase letter followed by 0 or 1 lowercase letters
  • [A-Z][a-z]* : Match uppercase letter followed by 0 or more lowercase letters. (* is greedy, it will match as many as found)
>>> import re
>>> ex = 'OCaOSeOO'
>>> re.findall('[A-Z]', ex)
['O', 'C', 'O', 'S', 'O', 'O']
>>> re.findall('[A-Z][a-z]', ex)
['Ca', 'Se']
>>> re.findall('[A-Z][a-z]?', ex)
['O', 'Ca', 'O', 'Se', 'O', 'O']
>>>

Edit #1: For another example mentioned in the comments:

>>> ex = 'ABbCccDdddEeeeeFfffffGggggggg'
>>> re.findall('[A-Z][a-z]*', ex)
['A', 'Bb', 'Ccc', 'Dddd', 'Eeeee', 'Ffffff', 'Gggggggg']
Ashok Arora
  • 531
  • 1
  • 6
  • 17
  • Thank you so much for this explanation. I had never heard about this "re" module. I will try to incorporate your code into mine and see if it works. – Fedra Herman Nov 26 '21 at 12:07
  • What if I would have chemical elements consisting out of 3 letters? How can I change the number of letters of each atom in my list? For example, if in the code you wrote ex = 'ABbCccDdddEeeeeFfffffGggggggg'. The answer would need to be {'A': 1, 'Bb': 1, 'Ccc': 1, 'Dddd': 1, 'Eeeee': 1, 'Ffffff': 1, 'Gggggggg': 1} Is this something I can also include in the code? – Fedra Herman Nov 26 '21 at 12:30
  • 1
    @FedraHerman I have edited the answer to include the example that you mentioned. Notice the `*` and `?` characters. I have also added explanation for the same. – Ashok Arora Nov 26 '21 at 12:57
  • Thanks for your help. Your code seems to do the trick. I will incorporate it into my own code :) – Fedra Herman Nov 27 '21 at 12:34
1

use re.findall

import re
re.findall('.[^A-Z]*', 'OCaOSeOO')

Output:

['O', 'Ca', 'O', 'Se', 'O', 'O']
Yefet
  • 2,010
  • 1
  • 10
  • 19
0

You can you re.split with lookarounds:

import re
structuralFormula1 = 'OCaOSeOO'
print(re.split(r'(?<!^)(?=[A-Z])', structuralFormula1))

It splits at every position

  • Not preceded by the beginning of the string: (?<!^) => this avoids a leading empty match
  • Followed by an uppercase letter: (?=[A-Z])

Output:

['O', 'Ca', 'O', 'Se', 'O', 'O']
Tranbi
  • 11,407
  • 6
  • 16
  • 33