3

say I have a molecular formula like "C5Cl2NO2S" for which I'd like to calculate in R the molecular mass. I though the easiest would be to use a regular expression, to analyze and split the formula into it's elemental components and hand those over to a separate function which performs the calculation. However, I'm facing the problem that, when I hand over the backreferences of my RegEx, these are not evaluated but handed over as "\\1", "\\2".

This is my attempt:

masses <- list(
  C  = 12,
  H  = 1.01,
  Cl = 34.97,
  N  = 14.00,
  O  = 15.99,
  P  = 30.97,
  S  = 31.97
)

elementMass <- function( element, count ) {
  if( count == "" ) {
    count <- "1"
  }
  return( as.character( masses[[ element ]] * as.numeric( count ) ) )
}


sumFormula2Mass <- function( x ){
  y <- 0.0
  for( e in x ) {
    if( e != "" ) {
      y <- y + as.numeric( sub( "^(C|H|Cl|N|O|P|S)([0-9]*)$", elementMass("\\1", "\\2"), e ) )
    }
  }
  return( y )
}

sub(
  "^(C[0-9]*)?(H[0-9]*)?(Cl[0-9]*)?(N[0-9]*)?(O[0-9]*)?(P[0-9]*)?(S[0-9]*)?$",
  sumFormula2Mass( c("\\1", "\\2", "\\3", "\\4", "\\5", "\\6", "\\7") ),
  "C5Cl2NO2S"
)

Any ideas how to improve this? Many thanks

Beasterfield
  • 7,023
  • 2
  • 38
  • 47

4 Answers4

5

Below we assumed the form of formula in the question, i.e. a string of components each of which is an upper case letter followed optionally by lower case letters followed optionally by digits. We use gsubfn in the gsubfn package. It is like gsub except the replacement string can be various other objects. Here its a proto object. A proto object is an environment and here its used for containing a property, sum, and two methods, pre and fun. At the beginning pre is automatically run and has the effect of initializing sum. Then each time the regular expression matches, the proto object and the two referenced strings are passed to fun and fun is run to process them. At the end p$sum contains the result. The variable masses is defined in the question.

library(gsubfn)
p <- proto(pre = function(this) this$sum <- 0,
    fun = function(this, name, count) {
        count <- as.numeric(count)
        if (is.na(count)) count <- 1
        this$sum <- this$sum + masses[[name]] * count
        ""
    })
gsubfn("([[:upper:]][[:lower:]]*)(\\d*)", p, "C5Cl2NO2S")
p$sum # 207.89
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

I don't think that backreferences in sub() work like this. You seem to be treating them as though they are returned values, when they are inputs.

Here is a different solution. It takes a very different approach, namely splitting the string up into individual parts and then referring to these. However, it has some limitations. Firstly, it assumes cannot handle chemical formulae with parentheses. Secondly, it assumes that atoms are written sensibly (i.e. chlorine is written as Cl - upper case C and lower case l). There are probably lots of other limitations, but this should give you an idea about how this solution might look.

sumFormula2Mass2 <- function(x,masses){
  summedMasses <- NULL
  for(e in x){
    ## split up the string
    split.e <- unlist(strsplit(e,''))
    ## join letters from individual elements (since subequent letters should be lower case)
    ilower <- grep('[a-z]',split.e)
    if(length(ilower) > 0){
      for(i in 1:length(ilower)){
        j <- ilower[i]
        split.e <- c(if(j > 2) split.e[1:(j-2)],
                     paste(split.e[(j-1):j],collapse=''),
                     if(j < length(split.e)) split.e[(j+1):length(split.e)])
        ilower <- ilower - 1
      }
    }
    ## join numbers together (in case there are more than 10 atoms)
    inum <- grep('[0-9]',split.e)
    if(length(inum) > 1){
      for(i in 1:(length(inum)-1)){ 
        if(inum[i + 1] == inum[i] + 1){
          j <- inum[i]
          split.e <- c(split.e[1:(j-1)],
                       paste(split.e[j:(j+1)],collapse=''),
                       if(j+2 <= length(split.e)) split.e[(j+2):length(split.e)])
          inum <- inum - 1
        }
      }
    }
    ## add up the mass
    sumMass = 0
    for(i in 1:length(split.e)){
      if(length(grep('[1-9]',split.e[i])) > 0){
        next 
      } else if(split.e[i] %in% names(masses)){
        nMolecules <- 1
        if(i != length(split.e) && length(grep('[1-9]',split.e[i+1])) > 0)
          nMolecules <- as.numeric(split.e[i+1])
        sumMass <- sumMass + nMolecules * masses[[split.e[i]]]
      } else {
        warning(sprintf('Could not match element %s',split.e[i]))
        next
      }
    }
    summedMasses <- c(summedMasses,sumMass)
  }
  return(summedMasses)
}

Here are some results for your compound plus some made-up compounds (I am not a chemist):

> sumFormula2Mass2(c("C5Cl2NO2S","C5Cl2NO2S4","C5Cl10NO2S4"),masses)
[1] 207.89 303.80 583.56
nullglob
  • 6,903
  • 1
  • 29
  • 31
  • I was thinking to do the same as you +1 – Luciano Selzer Aug 10 '11 at 15:33
  • Thanks for your solution. Your limitations are completely OK for my case. Basically you brought it to the point: I need a function which gives me the matching groups as return values. I finally found my solution with `unlist( strsplit( sub( ..., replacement="\\1#\\2\\#3\\4# ... ", ... ), split="#" ) )`. Not really elegant, but efficient :-) – Beasterfield Aug 10 '11 at 15:43
0

Believe it or not, this appears to not be all that uncommon:

Parsing a chemical formula

How to check that a regular expression has matched a string completely, i.e. - the string did not contain any extra character?

http://www.sitepoint.com/forums/php-34/chemical-formula-regular-expressions-317012.html

I found the above by Googling molecular formula regex

Community
  • 1
  • 1
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • Thanks for the links but unfortunately none of them have to do with my problem. The molecular formula is just the subject I am working on today, but my problem is the handling of RegEx' backreferences in R. – Beasterfield Aug 10 '11 at 15:29
  • woops. Sorry about that. I'll leave the question here as others may stumble on this question looking for those :) – JD Long Aug 10 '11 at 16:20
0

have a look at

RSiteSearch ("molecular weight")

I guess the 2nd or 3rd hit is what you are looking for (the first is for proteins).

(sorry, didn't see your comment on the other answer - however, I leave this in case someone is really looking for molecular weight calculation).

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57