Comparing two columns "by the sequence" and making new column

Question

The problem is really hard to explain but let me tell you what I want to get from this data. So, I have a data with like 20 different columns and two of them are already showed here.

Sequence             modifications
AAAAGAAAVANQGKK     [14] Acetyl (K)|[15] Acetyl (K)
AAAAGAAAVANQGKK     [14] Acetyl (K)|[15] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [7] Acetyl (K)|[12] Acetyl (K)
AAIKFIKFINPKINDGE   [4] Acetyl (K)|[7] Acetyl (K)
AAIYKLLKSHFRNE      [5] Biotin (K)|[8] Acetyl (K)
AAKKFEE             [3] Acetyl (K)|[4] Acetyl (K)

As you see in the same sequence there can be a different modifications. Sometimes there can be 3x Acetyl, simetimes 2x acetyl, sometimes only once and in other case there won't be any modification. There are only 2 modifications I am interested in "Biotin and Acetyl", others are not important. The numbers of modifications is dependent on the number of "K" in the sequence. For example if there are 3 "K" in the sequence the numbers of possible modifications i 0, 1, 2, 3 and never more than 3. So I would like to group those sequences (1000 rows) depending on the number of "K" in the sequence and the number and type of the modification which it has without smashing the other columns.

What I want to get from this data by R, it's a different groups of the sequences with specified modification. For example:

First Group: (number of "K" in the sequence = 2, and both modified by acetyl)

Sequence             modifications
AAAAGAAAVANQGKK      [14] Acetyl (K)|[15] Acetyl (K)
AAIYKLLKSHFRNE       [5] Acetyl (K)|[8] Acetyl (K)

Second Group: (number of "K" in the sequence = 2, and one modified by acetyl, second nothing)

Third Group: (number of "K" in the sequence = 3, and one modified by acetyl, second acetyl, and last is biotin)

I have to include all of the possiblities. That's what I think would be best on this "part" of the script which I am trying to write. Maybe you have any other suggestions how to interprate that data.

Second problem is that: I calculated the mean of the values in 3 different columns and I would like to put the result in the same data but in another column. How to do that ?

tbl_imp$mean <- rowMeans(subset(tbl_imp, select = c("x", "y", "w")), na.rm = TRUE)
tbl_imp$mean <- data.frame(tbl_imp$mean)

The code I used to calculate the means of the rows. I just don't know how to make a new column in the data I have and put there my results of mean. ?transform function should I use ?

It is better to ask one problem per question. And it would be easier to answer with a sample of what you want to get in the first problem, and a sample of your data for the second. — juba, Oct 10 '13 at 11:02
Can you please [`dput` your sample data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — Henrik, Oct 10 '13 at 11:14

score 0 · Accepted Answer · answered Oct 10 '13 at 12:43

Something like this might work for your first part. I'm unable to download the file right now but when I can, I will try and respond to the second part as well.

library(data.table)
library(stringr)

# Slightly modified dataset
dataset <- data.table(
Sequence  = c(
'AAAAGAAAVANQGKK'    
,'AAAAGAAAVANQGKK'    
,'AAIKFIKFINPKINDGE'  
,'AAIKFIKFINPKINDGE'  
,'AAIKFIKFINPKINDGE'  
,'AAIKFIKFINPKINDGE'
,'AAIYKLLKSHFRNE'
,'AAKKFEE'
),
 modifications = c(
'[14] Acetyl (K)|[15] Acetyl (K)'
,'[14] Acetyl (K)|[15] Acetyl (K)'
,'[4] Acetyl (K)|[7] Something (K)|[12] Acetyl (K)'
,'[4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)'
,'[7] Acetyl (K)|[12] Acetyl (K)'
,'[4] Acetyl (K)|[7] Acetyl (K)'
,'[5] Biotin (K)|[8] Acetyl (K)'
,'[3] Acetyl (K)'
)
)

# get the 1st, 2nd, 3rd modifications in separate columns
dataset <- data.table(cbind(
   dataset,
   str_split_fixed(dataset[,modifications], pattern = "\\(K\\)",3)
))

dataset[,':='(
   V1 = as.character(V1),
   V2 = as.character(V2),
   V3 = as.character(V3)
)]

# Count of modifications    
dataset[, NoOfKs := 3]
dataset[V3 == "", NoOfKs := 2]
dataset[V2 == "", NoOfKs := 1]
dataset[V1 == "", NoOfKs := 0]

# Retaining Acetyl/Biotin or no modification only
dataset[, AB01 := TRUE]
dataset[, AB02 := TRUE]
dataset[, AB03 := TRUE]

dataset[V1 != "",  AB01 := grepl(V1, pattern = "Acetyl|Biotin")]
dataset[V2 != "",  AB02 := grepl(V2, pattern = "Acetyl|Biotin")]
dataset[V3 != "",  AB03 := grepl(V3, pattern = "Acetyl|Biotin")]

dataset <- dataset[AB01 & AB02 & AB03]

# Marking each modification as acetyl/biotin/none
dataset[V1 != " " & grepl(V1, pattern = "Acetyl"), AB1 := "Acetyl"]
dataset[V1 != " " & grepl(V1, pattern = "Biotin"), AB1 := "Biotin"]
dataset[V2 != " " & grepl(V2, pattern = "Acetyl"), AB2 := "Acetyl"]
dataset[V2 != " " & grepl(V2, pattern = "Biotin"), AB2 := "Biotin"]
dataset[V3 != " " & grepl(V3, pattern = "Acetyl"), AB3 := "Acetyl"]
dataset[V3 != " " & grepl(V3, pattern = "Biotin"), AB3 := "Biotin"]

dataset[
   ,
   list(
   Sequence = Sequence, 
   modifications = modifications, 
   GroupID = .GRP
   ),
   by = c('NoOfKs','AB1','AB2','AB3')
]

Output

   NoOfKs    AB1    AB2    AB3          Sequence                                 modifications GroupID
1:      2 Acetyl Acetyl     NA   AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)       1
2:      2 Acetyl Acetyl     NA   AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)       1
3:      2 Acetyl Acetyl     NA AAIKFIKFINPKINDGE                [7] Acetyl (K)|[12] Acetyl (K)       1
4:      2 Acetyl Acetyl     NA AAIKFIKFINPKINDGE                 [4] Acetyl (K)|[7] Acetyl (K)       1
5:      3 Acetyl Acetyl Acetyl AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)       2
6:      2 Biotin Acetyl     NA    AAIYKLLKSHFRNE                 [5] Biotin (K)|[8] Acetyl (K)       3
7:      1 Acetyl     NA     NA           AAKKFEE                                [3] Acetyl (K)       4

You did a pretty nice job. Let me try to analyse it. I need some time coz I use R studio (much easier for me to use it coz of the colours etc) and the package data.table and stringr is not avaialbe for the R studio yet. I'll try to install R and we will see. — Shaxi Liver, Oct 10 '13 at 13:42
Rstudio is a GUI wrapper for R. Any R package should work on Rstudio as well. What's the error message exactly? — TheComeOnMan, Oct 10 '13 at 13:57
Anyway, I already updated my Rstudio (there were some problems of course) and your script is really great. That shows everything I needed. I am not able to vote up coz my rep is only 15. — Shaxi Liver, Oct 11 '13 at 12:26
Anyway, the maximum number of "K" in this data is 6 and I tried to rewrite your script to 6 possible Ks and it doesn't work for me. Could you help me with that ? — Shaxi Liver, Oct 11 '13 at 12:56
The max(k) = 3 is hard-coded in a way, each block has 3n sub-steps if you noticed. You'll need to extend each block to 6n steps. — TheComeOnMan, Oct 15 '13 at 05:52

alexis_laz · Answer 2 · 2013-10-11T13:15:00.710

I loaded your data as the object aa.

    mydata <-  data.frame(seqs = aa$Sequence, mods = aa$modifications) # subset of aa with sequences and modifications

    ##to find number of "K"s
    spl_seqs <- strsplit(as.character(mydata$seqs), split = "")  # split all sequences (use "as.character" because they are turned into factor)
    where_K <- lapply(spl_seqs, grep, pattern = "K") # find positions of "K"s in each sequence
    No_K <- lapply(where_K, length) # count "K"s in each sequence

    mydata$No_Ks <- No_K #add a column that informs about the number of "K"s in each sequence
    ##

I suppose all upper-case letters that appear to "modifications" column either refer to the modification being made or to the "K". I can't think of any other way to simplify the "modifications" column in order to manipulate them. So I, just, keep the uppercase letters that are not "K":

    names(LETTERS) <- LETTERS  # DWin's idea in this http://stackoverflow.com/questions/4423460/is-there-a-function-to-find-all-lower-case-letters-in-a-character-vector 

    spl_mods <- strsplit(as.character(mydata$mods), split = "")  # split the characters in each modification row

Simplify modifications column keeping only the first letter of each modification:

    mods_ls <- vector("list", length = nrow(mydata))  #list to fill with simplified modifications
    for(i in 1:length(spl_mods))
     {
      res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]])) #keep only upper-case letters

      res <- as.character(na.omit(gsub("K", NA, res)))  # exclude "K"s 
      res <- as.character(na.omit(gsub("M", NA, res)))  # and "M"s I guessed

      mods_ls[[i]] <- res
     }
    mydata$simplified_mods <- unlist(lapply(mods_ls, paste, collapse = " ; "))

What we've got so far:

    mydata[1:10,]
    #                seqs                                          mods No_Ks simplified_mods
    #1    AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
    #2    AAAAGAAAVANQGKK               [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
    #3      AAFTKLDQVWGSE                                [5] Acetyl (K)     1               A
    #4  AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       A ; A ; A
    #5  AAIKFIKFINPKINDGE [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       A ; A ; A
    #6  AAIKFIKFINPKINDGE                [7] Acetyl (K)|[12] Acetyl (K)     3           A ; A
    #7  AAIKFIKFINPKINDGE                 [4] Acetyl (K)|[7] Acetyl (K)     3           A ; A
    #8     AAIYKLLKSHFRNE                 [5] Biotin (K)|[8] Acetyl (K)     2           B ; A
    #9            AAKKFEE                 [3] Acetyl (K)|[4] Acetyl (K)     2           A ; A
    #10           AAKYFRE                                [3] Acetyl (K)     1               A

Then you can subset the number of "K"s and the specific modifications you want. E.g.:

    how_many_K <- 2 
    what_mods <- "A ; A"    #separated by [space];[space]

    show_rows <- which(mydata$No_Ks == how_many_K & mydata$simplified_mods == what_mods)  
    mydata[show_rows,]
    #                             seqs                            mods No_Ks simplified_mods
    #1                 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
    #2                 AAAAGAAAVANQGKK [14] Acetyl (K)|[15] Acetyl (K)     2           A ; A
    #9                         AAKKFEE   [3] Acetyl (K)|[4] Acetyl (K)     2           A ; A
    #11                     AANVKKTLVE   [5] Acetyl (K)|[6] Acetyl (K)     2           A ; A
    #14  AARDSKSPIILQTSNGGAAYFAGKGISNE  [6] Acetyl (K)|[24] Acetyl (K)     2           A ; A
    #20                        AEKLKAE   [3] Acetyl (K)|[5] Acetyl (K)     2           A ; A
    #21
    #....

EDIT: All this can be done in a function like fun. x is your data.frame (as the uploaded "for Henrik" with the structure). noK is the number of "K"s you want. mod is the modifications you want separated by [space];[space] (e.g. "B ; A ; O").:

    fun <- function(x, noK, no_modK = NULL, mod = NULL) #EDIT_1e: update arguments; made optional
    {
     mydata <- data.frame(seqs = x$Sequence, mods = x$modifications) 

     spl_seqs <- strsplit(as.character(mydata$seqs), split = "")  
     where_K <- lapply(spl_seqs, grep, pattern = "K") 
     No_K <- lapply(where_K, length)

     mydata$No_Ks <- No_K 

     names(LETTERS) <- LETTERS  

     spl_mods <- strsplit(as.character(mydata$mods), split = "")  

     mods_ls <- vector("list", length = nrow(mydata))  
     for(i in 1:length(spl_mods))
      {
       res <- as.character(na.omit(LETTERS[strsplit(as.character(mydata$mods), split = "")[[i]]])) 

       no_modedK <- length(grep("K", res))   #EDIT_1a: how many "K"s are modified?

       res <- as.character(na.omit(gsub("K", NA, res)))   
       res <- as.character(na.omit(gsub("M", NA, res)))  

       mods_ls[[i]] <- list(mods = res, modified_K = no_modedK) #EDIT_1b: catch number of "K"s modified (along with the actual modifications) 
      }

     mydata$no_modK <- unlist(lapply(lapply(lapply(mods_ls, `[`, 2), unlist), paste, collapse = " ; ")) #EDIT_1d: insert number of modified "K"s in "mydata"   
     mydata$simplified_mods <- unlist(lapply(lapply(lapply(mods_ls, `[`, 1), unlist), paste, collapse = " ; ")) #EDIT_1c: insert mods in "mydata"  

     if(!is.null(no_modK) & !is.null(mod)) #EDIT_1f: update "return"
      {
       show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK & mydata$simplified_mods == mod) 
      }
     if(is.null(no_modK) & !is.null(mod))
      {
       show_rows <- which(mydata$No_Ks == noK & mydata$simplified_mods == mod) 
      } 
     if(is.null(mod) & !is.null(no_modK)) 
      {
       show_rows <- which(mydata$No_Ks == noK & mydata$no_modK == no_modK)
      }

     if(is.null(no_modK) & is.null(mod)) 
      {
       show_rows <- which(mydata$No_Ks == noK) 
      } 

     return(mydata[show_rows,])
    }

E.g.:

    fun(aa, noK = 3) #aa is the the "for Henrik" loaded in `R` (aa <- structure( ... )
                                  seqs                                                             mods No_Ks no_modK simplified_mods
    4                AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       3       A ; A ; A
    5                AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)|[12] Acetyl (K)     3       3       A ; A ; A
    6                AAIKFIKFINPKINDGE                                   [7] Acetyl (K)|[12] Acetyl (K)     3       2           A ; A
    #...
    fun(aa, noK = 3, no_modK = 2)
                             seqs                                             mods No_Ks no_modK simplified_mods
    6           AAIKFIKFINPKINDGE                   [7] Acetyl (K)|[12] Acetyl (K)     3       2           A ; A
    7           AAIKFIKFINPKINDGE                    [4] Acetyl (K)|[7] Acetyl (K)     3       2           A ; A
    #...

    fun(aa, noK = 2, mod = "A ; B")
                  seqs                           mods No_Ks no_modK simplified_mods
    200    ISAMVLTKMKE [8] Acetyl (K)|[10] Biotin (K)     2       2           A ; B
    441 NLKPSKPSYYLDPE  [3] Acetyl (K)|[6] Biotin (K)     2       2           A ; B
    #...

    fun(aa, noK = 2, no_modK = 1, mod =  "A")
                                     seqs            mods No_Ks no_modK simplified_mods
    15      AARDSKSPIILQTSNGGAAYFAGKGISNE [24] Acetyl (K)     2       1               A
    27                     AKALVAQGVKFIAE  [2] Acetyl (K)     2       1               A
    #...

EDIT_1: Updated fun and examples.

Guys, you did a pretty nice job here. Let me analyse it. I am so grateful for that coz you "wasted" some time to help a random guy. I didn't know that it might work like that! — Shaxi Liver, Oct 10 '13 at 13:40
I've got one for more question to you Alexis. Is it possible to check if there are like 2 "K" and just only one is modified ? I can only put the number of "how_many_K" and it cannot be a vector like 1-3 or something. Sometimes there are 2 "K" and just only one if modified by "biotin" or "acetyl". I hope you know what I mean. — Shaxi Liver, Oct 11 '13 at 10:37
I added an argument to `fun` so that data can be subsetted either by number of "K"s or by number of "K"s _and_ number of "K"s that are modified (not _which_ "K"s are modified, though).Is this you're looking for? (E.g. `fun(x, noK = 2, no_Ks_modified = 1, modification = "whatever")`) — alexis_laz, Oct 11 '13 at 11:46
Updated `fun` and the examples in my answer. Added the `no_modK` argument to subset data with a specific number of modified "K"s (along with the number of total "K"s present; `noK` argument). — alexis_laz, Oct 11 '13 at 13:17

Comparing two columns "by the sequence" and making new column

2 Answers2