Assign numbers to each letter so that r calculates the sum of the letters in a word

Question

I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc. I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline() I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions. The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.

Split the string (use `strplit`) then use `match`. If you want code in an answer, you'll need to make a small, reproducible, illustrative example. [See tips on that here](http://stackoverflow.com/q/5963269/903061). People will be very happy if you use simulated data or use `dput()` to share data. — Gregor Thomas, Apr 11 '16 at 21:16
what would be helpful in this case is a sample of how your input is structured (maybe a vector of peptides and a "cost" matrix" - a matrix of amino acids with the quantity of each element); this is more or less what you describe in your question. `dput` this data in your question, and I think we could help you with some code — Chris, Apr 11 '16 at 21:24

Lars Arne Jordanger · Accepted Answer · 2016-04-12T19:45:41.133

There's two main components to take care of here: The selection of a method for the storing of the basic data and the algorithm that computes the result you desire.

For the computation, it might be preferable to have your data stored in a matrix, due to the way R recycles the shorter vector when multiplying two vectors. This recycling also kicks in if you want to multiply a matrix with a vector, since a matrix is a vector with some additional attributes (that is to say, dimension and dimension-names). Consider the example below to see how it works

test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)

test_matrix
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

test_matrix * test_vec
     [,1] [,2] [,3] [,4]
[1,]    3   12   21   30
[2,]    0    0    0    0
[3,]    3    6    9   12

Based on this observation, it's possible to deduce that a solution where each amino acid has one row in a matrix might be a good way to store the look up data; when we have a counting vector with specifying the desired amount of contribution from each row, it will be sufficient to multiply our matrix with our counting vector, and then sum the columns - the last part solved using colSums.

colSums(test_matrix * test_vec)
[1]  6 18 30 42

It's in general a "pain" to store this kind of information in a matrix, since it might be a "lot of work" to update the information later on. However, I guess it's not that often it's required to add new amino acids, so that might not be an issue in this case.

So let's create a matrix for the the five amino acids needed for the peptide you mentioned in your example. The numbers was found on Wikipedia, and hopefully I didn't mess up when I copied them. Just follow suit to add all the other amino acids too.

amino_acids <- rbind(
    G = c(C = 2, H = 5,  N = 1, O = 2),
    L = c(C = 6, H = 13, N = 1, O = 2),
    H = c(C = 6, H = 9,  N = 3, O = 2),
    K = c(C = 6, H = 14, N = 2, O = 2),
    Y = c(C = 9, H = 11, N = 1, O = 3))

amino_acids
  C  H N O
G 2  5 1 2
L 6 13 1 2
H 6  9 3 2
K 6 14 2 2
Y 9 11 1 3

This matrix contains the information we want, but it might be preferable to have them in lexicographic order - and it would be nice to ensure that we haven't by mistake added the same row twice. The code below takes care of both of these issues.

amino_acids <-
    amino_acids[sort(unique(rownames(amino_acids))), ]

amino_acids                   
  C  H N O
G 2  5 1 2
H 6  9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3

The next part is to figure out how to deal with the peptides. This will here be done by first using strsplit to split the string into separate characters, and then use a table-solution upon the result to get the vector that we want to multiply with the matrix.

peptide <- "KGHLY"

peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"

Using table upon peptide_2 gives us

table(peptide_2)
peptide_2
G H K L Y 
1 1 1 1 1

This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.

Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.

peptide_vec <- table(peptide_2)

colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
 C  H  N  O 
29 52  8 11

This outlines one possible solution for the core of your problem, and this can be collected into a function that takes care of all the steps for us.

peptide_function <- function(peptide, amino_acids) {
    peptide_vec <- table(
        unlist(strsplit(x = peptide, split = "")))
    ## Compute the result and return it to the work flow.
    colSums(
        amino_acids[names(peptide_vec), ] *
        as.vector(peptide_vec))
}

And finally a test to see that we get the same answer as before.

peptide_function(peptide = "GHKLY",
                 amino_acids = amino_acids)
 C  H  N  O 
29 52  8 11

What next? Well that depends on how you have stored your peptides, and what you would like to do with the result. If for example you have the peptides stored in a vector, and would like to have the result stored in a matrix, then it might e.g. be possible to use vapply as given below.

data_vector <- c("GHKLY", "GGLY", "HKLGL")

result <- t(vapply(
    X = data_vector,
    FUN = peptide_function,
    FUN.VALUE = numeric(4),
    amino_acids = amino_acids))

result
       C  H N  O
GHKLY 29 52 8 11
GGLY  19 34 4  9
HKLGL 26 54 8 10

Wow, many many thanks for this extensive answer. I have worked through your example and expanded the number of amino acids and elements to give: data_vector <- c("CPYEEHIK", "FKDLGEQHFK", "FPNAEFAEITK", "LGEYGFQNAVLVR", "DVFLGTFLYEYSR") result <- vapply(data_vector, FUN = atomComp, FUN.VALUE = numeric(6)) # C H N O S D # CPYEEHIK 45 63 11 12 1 16.95 etc... — Jatin, Apr 13 '16 at 16:53

Assign numbers to each letter so that r calculates the sum of the letters in a word

1 Answers1