0

I'm trying to split a molecule as a string into it's individual atom components. Each atom starts at a capital letter and ends at the last number.

For example, 'SO4' would become ['S', 'O4'].

And 'C6H12O6' would become ['C6', 'H12', 'O6'].

Pretty sure I need to use the regex module. This answer is close to what I'm looking for: Split a string at uppercase letters

Community
  • 1
  • 1
Johnny Metz
  • 5,977
  • 18
  • 82
  • 146

1 Answers1

4

Use re.findall() with the pattern:

[A-Z][a-z]?\d*
  • [A-Z] matches any uppercase character

  • [a-z]? matches zero or one lowercase character

  • \d* matches zero or more digits

Based on your example this should work, although you should look out for any specific library for this purpose.

Example:

>>> re.findall(r'[A-Z][a-z]?\d*', 'C6H12O6')
['C6', 'H12', 'O6']

>>> re.findall(r'[A-Z][a-z]?\d*', 'SO4')
['S', 'O4']

>>> re.findall(r'[A-Z][a-z]?\d*', 'HCl')
['H', 'Cl']
heemayl
  • 39,294
  • 7
  • 70
  • 76
  • Perfect answer thanks! What about breaking up the individual components to their atom and number? So ['H12'] will become ['H', 12] and ['Ca2'] will become ['Ca', 2]. – Johnny Metz Oct 07 '16 at 23:36