0

Hi I'm trying to extract some information from a chemical formula and add them to a pre-existing table on r. Currently I have a column that have chemical formulas as shown (C4H8O2). I have no problem extracting each element and its corresponding number. However I have a problem when brackets are involved in the formula, such as C3[13]C1H8O2. I want the title to say 13[C] and the input be 1. However my code doesn't recognize '[13]C1' so it gives me an error.

Any suggestions would be great.

#First manipuation - extracting information out of the "Composition" column, into seperated columns for each element

data2 <- dataframe%>%mutate(Composition=gsub("\\b([A-Za-z]+)\\b","\\11",Composition),
              name=str_extract_all(Composition,"[A-Za-z]+"),
              value=str_extract_all(Composition,"\\d+"))%>%
   unnest()%>%spread(name,value,fill=0)

I already have a pre-made csv file that has the table organized and I made that into a data frame, so now I'm just trying to parce out the elements with the the 'C' column and '[13]C' column and their corresponding number.

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
David
  • 43
  • 5
  • Welcome to Stack! It is best if you can include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Could you do ```dput(head(dataframe))```(assuming ```dataframe``` is the actual name of your dataframe) and include that in your question above and that will help others test out their answers and show you the results. – Russ Thomas Jul 12 '19 at 00:50

1 Answers1

1

The following regular expression should extract the isotope number, the element, and the number of atoms.

library(stringr)
str_match_all( "C3[13]C1H8O2", "(\\[[0-9]+\\])?([A-Za-z]+)([0-9]+)" )
## [[1]]
##      [,1]     [,2]   [,3] [,4]
## [1,] "C3"     NA     "C"  "3" 
## [2,] "[13]C1" "[13]" "C"  "1" 
## [3,] "H8"     NA     "H"  "8" 
## [4,] "O2"     NA     "O"  "2" 

With a data.frame:

library(tidyr)
library(dplyr)
d <- data.frame( Composition = c( "H2O1", "C3[13]C1H8O2" ) )
pattern <- "(\\[[0-9]+\\])?([A-Za-z]+)([0-9]+)"
d %>%
  mutate( Details = lapply( str_match_all( Composition, pattern ), as.data.frame ) ) %>%
  unnest() %>%
  transmute(
    Composition,
    element = paste0( ifelse(is.na(V2),"",V2), V3 ),
    number = V4
  ) %>% 
  spread(key="element", value="number") %>%
  replace(., is.na(.), 0)

##    Composition [13]C C H O
## 1 C3[13]C1H8O2     1 3 8 2
## 2         H2O1     0 0 2 1
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Hi, thank you for getting back so quick! So i was actually looking to have one row giving the following information in columns -> "Composition", "C", "[13]C", "H", "O" . Is this possible to do? This is my first time using R so i was thrown into the loop without any training. – David Jul 12 '19 at 07:29