2

I have a list of thousands of chemical formulas that could include symbols for any element. I would like to determine the total number of atoms of any element in each formula. Examples include:

  • CH3NO3
  • CSe2
  • C2Cl2
  • C2Cl2O2
  • C2Cl3F
  • C2H2BrF3
  • C2H2Br2
  • C2H3Cl3Si

I want the total number of atoms in a single formula, so for the first example (CH3NO3), the answer would be 8 (1 carbon + 3 hydrogens + 1 nitrogen + 3 oxygens).

I found code by PEH (Extract numbers from chemical formula) that uses regular expression to extract the number of instances of a specific element in a chemical formula.

Could this be adapted to give the total atoms?

Public Function ChemRegex(ChemFormula As String, Element As String) As Long
    Dim regEx As New RegExp
    With regEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With
    
    'first pattern matches every element once
    regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
    
    Dim Matches As MatchCollection
    Set Matches = regEx.Execute(ChemFormula)
    
    Dim m As Match
    For Each m In Matches
        If m.SubMatches(0) = Element Then
            ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
        End If
    Next m
    
    'second patternd finds parenthesis and multiplies elements within
    regEx.Pattern = "(\((.+?)\)([0-9])+)+?"
    Set Matches = regEx.Execute(ChemFormula)
    For Each m In Matches
        ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
    Next m
End Function
Community
  • 1
  • 1
Daniel
  • 57
  • 9
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Oct 20 '21 at 08:48
  • 2
    You could do that by looping through all characters. Count all capital characters and add all numbers (subtracted by 1). That is the total count of elements. – Pᴇʜ Oct 20 '21 at 08:48
  • Try comment out `If m.SubMatches(0) = Element Then` and it's `End If`. You can also remove `Element As String` from the argument since you are not filtering by element. Edit: `ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1), Element) * (m.SubMatches(2) - 1)` change to `ChemRegex = ChemRegex + ChemRegex(m.SubMatches(1)) * (m.SubMatches(2) - 1)` as well – Raymond Wu Oct 20 '21 at 08:57
  • Judging by your examples, you're not concerned with polymers, e.g. CH3(CH2)50CH3, where the 50 represents the number of repeating units (of CH2). – Jos Woolley Oct 20 '21 at 08:57

2 Answers2

6

You could do that by looping through all characters. Count all capital characters and add all numbers subtracted by 1. That is the total count of elements.

Option Explicit

Public Function ChemCountTotalElements(ByVal ChemFormula As String) As Long
    Dim RetVal As Long

    Dim c As Long
    For c = 1 To Len(ChemFormula)
        Dim Char As String
        Char = Mid$(ChemFormula, c, 1)
        
        If IsNumeric(Char) Then
            RetVal = RetVal + CLng(Char) - 1
        ElseIf Char = UCase(Char) Then
            RetVal = RetVal + 1
        End If
        
    Next c
    
    ChemCountTotalElements = RetVal
End Function

Note that this does not handle parenthesis! And it does not check if the element actually exists. So XYZ2 will be counted as 4.

Also this only can handle numbers below 10. In case you have numbers with 10 and above use the RegEx solution below (which can handle that).

enter image description here

Recognize also chemical formulas with prenthesis like Ca(OH)₂

If you need a more precise way (checking the existance of the Elements) and recognizing parenthesis you need to do it with RegEx again.

Because VBA doesn't support regular expressions out of the box we need to reference a Windows library first.

  1. Add reference to regex under Tools then References
    enter image description here

  2. and selecting Microsoft VBScript Regular Expression 5.5
    enter image description here

  3. Add this function to a module

    Public Function ChemRegexCountTotalElements(ByVal ChemFormula As String) As Long
        Dim RetVal As Long
    
        Dim regEx As New RegExp
        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
        End With
    
        'first pattern matches every element once
        regEx.Pattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)"
    
        Dim Matches As MatchCollection
        Set Matches = regEx.Execute(ChemFormula)
    
        Dim m As Match
        For Each m In Matches
            RetVal = RetVal + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1)
        Next m
    
        'second patternd finds parenthesis and multiplies elements within
        regEx.Pattern = "(\((.+?)\)([0-9]+)+)+?"
        Set Matches = regEx.Execute(ChemFormula)
        For Each m In Matches
            RetVal = RetVal + ChemRegexCountTotalElements(m.SubMatches(1)) * (m.SubMatches(2) - 1) '-1 because all elements were already counted once in the first pattern
        Next m
    
        ChemRegexCountTotalElements = RetVal
    End Function
    

While this code will also recognize parenthesis, note that it does not recognize nested parenthesis.

enter image description here

Pᴇʜ
  • 56,719
  • 10
  • 49
  • 73
  • 2
    Nice, but what if atoms are 10+? I'm no chemist by any means, so I'm not even sure if that's possible. – JvdV Oct 20 '21 at 09:14
  • 1
    @JvdV right, there was a mistake in the second pattern `(\((.+?)\)([0-9])+)+?` that has to be `(\((.+?)\)([0-9]+)+)+?` to recognise `10+` numbers correctly. The first solution does not support `10+` numbes (use regex version then). – Pᴇʜ Oct 20 '21 at 09:21
  • 1
    Both versions work. I tweaked the RegEx version to also recognise deuterium atoms by updating ```regEx.Pattern``` to include ```[D][bsy]?``` – I forgot to include an example chemical formula with deuterium in the original question, my bad. – Daniel Oct 20 '21 at 09:30
  • 1
    @Daniel yes but the first one has some limitations, while the RegEx version is clearly more powerful. – Pᴇʜ Oct 20 '21 at 09:32
  • @PEH I would like to use ChemRegex as a subroutine but I cannot make the part finds parenthesis and multiplies elements within work, see: https://stackoverflow.com/questions/70302851/how-to-speed-up-extracting-numbers-from-chemical-formula any thoughts? Thanks, Daniel – Daniel Dec 11 '21 at 12:56
  • @Daniel Can you open up a new question for that? Give example input data and expected output data along with a description of what exactly is going wrong. That would help a lot to find an answer. – Pᴇʜ Dec 11 '21 at 15:01
  • @PEH here is the new question: https://stackoverflow.com/q/70324430/17194644 – Daniel Dec 12 '21 at 14:28
1

Here's my two cent's

enter image description here

Formula in C1:

=ChemRegex(A1)

Where ChemRegex() calls:

Public Function ChemRegex(ChemFormula As String) As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = "[A-Z][a-z]*(\d*)"
    If .Test(ChemFormula) Then
        Set matches = .Execute(ChemFormula)
        For Each Match In matches
            ChemRegex = ChemRegex + IIf(Match.Submatches(0) = "", 1, Match.Submatches(0))
        Next
    Else
        ChemRegex = 0
    End If
End With

End Function

Or in a (shorter) 2-step regex-solution:

Public Function ChemRegex(ChemFormula As String) As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = "([A-Za-z])(?=[A-Z]|$)"
    ChemFormula = .Replace(ChemFormula, "$1-1")
    .Pattern = "\D+"
    ChemFormula = .Replace(ChemFormula, "+")
    ChemRegex = Evaluate(ChemFormula)
End With

End Function
JvdV
  • 70,606
  • 8
  • 39
  • 70