I have a list of thousands of chemical formulas that could include symbols for any element. I would like to determine the total number of atoms of any element in each formula. Examples include:
- CH3NO3
- CSe2
- C2Cl2
- C2Cl2O2
- C2Cl3F
- C2H2BrF3
- C2H2Br2
- C2H3Cl3Si
I want the total number of atoms in a single formula, so for the first example (CH3NO3), the answer would be 8 (1 carbon + 3 hydrogens + 1 nitrogen + 3 oxygens).
I found code by PEH (Extract numbers from chemical formula) that uses regular expression to extract the number of instances of a specific element in a chemical formula.
Could this be adapted to give the total atoms?
Public Function ChemRegex(ChemFormula As String, Element As String) As Long
Dim regEx As New RegExp
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
End With
'first pattern matches every element once
regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
Dim Matches As MatchCollection
Set Matches = regEx.Execute(ChemFormula)
Dim m As Match
For Each m In Matches
If m.SubMatches(0) = Element Then
ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
End If
Next m
'second patternd finds parenthesis and multiplies elements within
regEx.Pattern = "(\((.+?)\)([0-9])+)+?"
Set Matches = regEx.Execute(ChemFormula)
For Each m In Matches
ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
Next m
End Function