2

I have two dataframes with unequal number of columns. I would like to subtract intensity values, column-wise (i.e, sample-wise), of rows of df2 from df1. My conditions are:

  1. In df1 there are multiple rows for peptide sequence (pep_seq) and their corresponding intensities per sample (int_sam) for every gene (gene_nm). The same gene appears multiple times, i.e, occupies several rows.
  2. In df2 the genes (rows) appears only once with its corresponding intensity values
  3. Therefore, df1 is much longer than df2 (e.g., 55000 rows vs 6000 rows)
  4. The number of intensity columns (int_samp) can be many. I have 3 in this example

Dataframe 1

pep_seq = c("aaaaaaaaa", "ababababba", "dfsfsfsfds", "xbbcbcncncc", "fbbdsgffhhh", "dggdgdgegeggerr", 
        "dfgthrgfgf", "wegregegg", "egegegergewge", "sfngegebser", "qegqeefbew", "qegqetegqt", 
        "qwtqtewr", "etghsfrgf", "sfsdfbdfbergeagaegr", "wasfqertsdfaefwe")
int_samp_1 = c("2421432", "24242424", "NA", "4684757849", "NA", "10485040", "NA", 
          "6849400", "40300", "NA", "NA", "NA", "556456466", "4646456466", "246464266", "4564242646")
int_samp_2 = c("NA", "5342353", "14532556", "43566", "46367367", "768769769", "797899", "NA", "NA", "NA", 
          "686899", "7898979", "678568", "NA", "68886", "488")
int_samp_3 = c("11351", "NA", "NA", "NA", "1354151345", "1351351354", "314534", "1535", "3145354", "4353455", 
          "324535", "3543445", "34535", "34535534", "NA", "NA")
gene_nm = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C")
df_1 = cbind.data.frame(pep_seq, int_samp_1, int_samp_2, int_samp_3, gene_nm)

Dataframe 2

int_samp_1a = c("2421432", "24242424", "NA")
int_samp_2a = c("NA", "5342353", "14532556")
int_samp_3a = c("11351", "NA", "NA")
gene_nm.a = c("A", "B", "C")
df_2 = cbind.data.frame(gene_nm.a, int_samp_1a, int_samp_2a, int_samp_3a)

Please suggest.

andrew_reece
  • 20,390
  • 3
  • 33
  • 58
ip2018
  • 655
  • 1
  • 7
  • 14
  • 1
    You have described your data nicely. Now what is your expected output in this case? – MKR Jun 02 '18 at 20:05
  • 1
    How do you want to treat `NA`? For example, if there is an `NA` in `df_2` but not in `df_1`, should the value in `df_1` stay the same, as if we were subtracting zero? What about if there is a value in `df_2` but `NA` in `df_1`? Should that end up a negative number, or stay `NA`? – andrew_reece Jun 02 '18 at 20:38
  • One thing really strange in data is that all measurements has been defined in `factor`. I don't think it was needed when eventually one need to perform subtraction on those. – MKR Jun 02 '18 at 20:46
  • @MKR........... Sorry I did not mentioned my required output. This can be either separate columns with the subtracted values or a separate dataframe. Both will be good. – ip2018 Jun 02 '18 at 22:10
  • @andrew_reece.. I can remove NAs from the dataframes depending on the situation. In this case would like to treat them as "0". So yes you are right. Thanks. – ip2018 Jun 02 '18 at 22:14
  • @IndranilPaul Have a look at answers provided below. Give me your feedback if that doesn't fit your expectations OR let me know if you cannot understand anything. – MKR Jun 02 '18 at 22:15
  • @Indranil Paul I updated my answer to convert `NA` to `0` before computing the differences. Is that what you had in mind? – andrew_reece Jun 02 '18 at 22:31

2 Answers2

2

One option can be to join df_1 and df_2 using dplyr and then perform simple matrix subtraction.

Note: The data frames got intensity readings in factor. I thought its not good idea to keep measurements in factor when you expect to perform subtraction. Hence I had converted those to integer.

library(dplyr)

# The NA values from df_2 has been changed to 0 since keeping those NA, will
# turn values in df_A NA for no reason. 
mod <- df_1 %>% left_join(df_2, by= c("gene_nm" = "gene_nm.a")) %>% # join on gene
  mutate_at(vars(starts_with("int_samp")), funs(as.integer(as.character(.)))) %>%
  mutate_at(vars(ends_with("a")), funs(ifelse(is.na(.),0L,.))) #Values are converted

# The modified data.frame got columns from both df_1 and df_2
mod[,grepl("^int_samp_\\d+$", names(mod))] <- 
                mod[,grepl("^int_samp_\\d+$", names(mod))] -  
                mod[,grepl("^int_samp_\\d+[a-z]+$", names(mod))]

# Take columns from df_1. 
mod[names(df_1)]
#                pep_seq int_samp_1 int_samp_2 int_samp_3 gene_nm
# 1            aaaaaaaaa          0         NA          0       A
# 2           ababababba   21820992    5342353         NA       A
# 3           dfsfsfsfds         NA   14532556         NA       A
# 4          xbbcbcncncc         NA      43566         NA       A
# 5          fbbdsgffhhh         NA   46367367 1354139994       A
# 6      dggdgdgegeggerr    8063608  768769769 1351340003       A
# 7           dfgthrgfgf         NA   -4544454     314534       B
# 8            wegregegg  -17393024         NA       1535       B
# 9        egegegergewge  -24202124         NA    3145354       B
# 10         sfngegebser         NA         NA    4353455       C
# 11          qegqeefbew         NA  -13845657     324535       C
# 12          qegqetegqt         NA   -6633577    3543445       C
# 13            qwtqtewr  556456466  -13853988      34535       C
# 14           etghsfrgf         NA         NA   34535534       C
# 15 sfsdfbdfbergeagaegr  246464266  -14463670         NA       C
# 16    wasfqertsdfaefwe         NA  -14532068         NA       C
MKR
  • 19,739
  • 4
  • 23
  • 33
  • I am getting the following error. Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘funs’ for signature ‘"numeric"’. Getting the same error with both the solutions. – ip2018 Jun 22 '18 at 18:25
  • Can you please check on which line you are getting error? You can select up to just before `%>%` and execute. You can find which line you are getting the error in that way. – MKR Jun 22 '18 at 18:35
  • Thanks. Seems the problem is with `funs(`. Can I remove the `funs(ifelse(is.na(.),0L,.))` from the code. I can do this operation on the dataframe outside – ip2018 Jun 22 '18 at 18:42
  • @ip2018 I think those column should have been converted to `Numeric` in `df2_2` at the first place. – MKR Jun 22 '18 at 18:45
  • Let me have a look. I will get back in about couple of hours. – MKR Jun 22 '18 at 19:02
  • No problem. Thanks – ip2018 Jun 22 '18 at 19:10
2

IIUC, you have columns of the same name in df_1 and df_2 (eg int_samp_X for some integer X), and you want to get the difference of matching column names, grouped by gene_nm (eg df_1[df_1$gene_nm == 'A', int_samp_1] - df_2[df_2$gene_nm == 'A', int_samp_1]).

We can use the tidyverse family of packages to solve this problem, notably, dplyr and purrr.

First, merge df_1 and df_2 with a left_join, to ensure all the many entries in df_1 are preserved when matched against the gene-level entries in df_2:

library(tidyverse)

df_3 <- df_1 %>% left_join(df_2, by = "gene_nm")

df_3
               pep_seq int_samp_1.x int_samp_2.x int_samp_3.x gene_nm int_samp_1.y int_samp_2.y int_samp_3.y
1            aaaaaaaaa      2421432           NA        11351       A      2421432           NA        11351
2           ababababba     24242424      5342353           NA       A      2421432           NA        11351
3           dfsfsfsfds           NA     14532556           NA       A      2421432           NA        11351
4          xbbcbcncncc   4684757849        43566           NA       A      2421432           NA        11351
5          fbbdsgffhhh           NA     46367367   1354151345       A      2421432           NA        11351
6      dggdgdgegeggerr     10485040    768769769   1351351354       A      2421432           NA        11351
7           dfgthrgfgf           NA       797899       314534       B     24242424      5342353           NA
8            wegregegg      6849400           NA         1535       B     24242424      5342353           NA
9        egegegergewge        40300           NA      3145354       B     24242424      5342353           NA
10         sfngegebser           NA           NA      4353455       C           NA     14532556           NA
11          qegqeefbew           NA       686899       324535       C           NA     14532556           NA
12          qegqetegqt           NA      7898979      3543445       C           NA     14532556           NA
13            qwtqtewr    556456466       678568        34535       C           NA     14532556           NA
14           etghsfrgf   4646456466           NA     34535534       C           NA     14532556           NA
15 sfsdfbdfbergeagaegr    246464266        68886           NA       C           NA     14532556           NA
16    wasfqertsdfaefwe   4564242646          488           NA       C           NA     14532556           NA

Then map over the column names of interest, taking the difference from each column-pair. (Note that you'll need to convert the int_samp columns from factor to numeric first.)

Update (per OP comments): To convert NA to 0 before computing differences, we can use mutate_if() and replace(), adding the following to the method chain:

mutate_if(is.numeric,  funs(replace(., is.na(.), 0)))

Finally, join back to df_1:

var_names <- df_1 %>% select(starts_with("int_samp")) %>% names()

var_names # [1] "int_samp_1" "int_samp_2" "int_samp_3"

var_names %>%
  map_dfc(~df_3 %>%
            mutate_at(vars(matches(.x)), funs(as.numeric(as.character(.)))) %>%
            mutate_if(is.numeric,  funs(replace(., is.na(.), 0))) %>%
            select(matches(.x)) %>%
            reduce(`-`)) %>%
  set_names(paste0(var_names, "_diff")) %>%
  bind_cols(df_1)

Output:

   int_samp_1_diff int_samp_2_diff int_samp_3_diff pep_seq             int_samp_1 int_samp_2 int_samp_3 gene_nm
             <dbl>           <dbl>           <dbl> <fct>               <fct>      <fct>      <fct>      <fct>  
 1              0.              0.              0. aaaaaaaaa           2421432    NA         11351      A      
 2       21820992.        5342353.         -11351. ababababba          24242424   5342353    NA         A      
 3       -2421432.       14532556.         -11351. dfsfsfsfds          NA         14532556   NA         A      
 4     4682336417.          43566.         -11351. xbbcbcncncc         4684757849 43566      NA         A      
 5       -2421432.       46367367.     1354139994. fbbdsgffhhh         NA         46367367   1354151345 A      
 6        8063608.      768769769.     1351340003. dggdgdgegeggerr     10485040   768769769  1351351354 A      
 7      -24242424.       -4544454.         314534. dfgthrgfgf          NA         797899     314534     B      
 8      -17393024.       -5342353.           1535. wegregegg           6849400    NA         1535       B      
 9      -24202124.       -5342353.        3145354. egegegergewge       40300      NA         3145354    B      
10              0.      -14532556.        4353455. sfngegebser         NA         NA         4353455    C      
11              0.      -13845657.         324535. qegqeefbew          NA         686899     324535     C      
12              0.       -6633577.        3543445. qegqetegqt          NA         7898979    3543445    C      
13      556456466.      -13853988.          34535. qwtqtewr            556456466  678568     34535      C      
14     4646456466.      -14532556.       34535534. etghsfrgf           4646456466 NA         34535534   C      
15      246464266.      -14463670.              0. sfsdfbdfbergeagaegr 246464266  68886      NA         C      
16     4564242646.      -14532068.              0. wasfqertsdfaefwe    4564242646 488        NA         C  

Note: This answer is largely a derivation from akrun's answer here.

andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • I'm just wondering why our results are not matching. Logic is quite similar. Few values are matching as well. – MKR Jun 02 '18 at 21:06
  • The results do not match. I am trying to understand why is that? – ip2018 Jun 02 '18 at 22:31
  • Just spot checking a few of the rows, the math looks good here. For example, row 6, `int_samp_1`: `10485040 - 2421432 = 8063608` – andrew_reece Jun 02 '18 at 22:33
  • @MKR - just looking at the values, ours seem to line up. Which ones don't? – andrew_reece Jun 02 '18 at 22:35
  • 1
    @andrew_reece I was looking for `int_samp_1_diff ` column. The `3rd, 4th and 5th` row are not matching. Perhaps, because I have kept the value as `NA` if that value was `NA` in `df_1`. This suggests our solutions are align. :-) – MKR Jun 02 '18 at 22:43
  • 1
    @IndranilPaul Perhaps these answers would have helped you. If so you can accept an answer and it would be better to have a look at this link https://stackoverflow.com/help/someone-answers – MKR Jun 02 '18 at 22:47
  • 1
    Thanks to both of you. You guys are awesome! I am accepting the solution from andrew_reece, mainly because I can understand it relatively easily. – ip2018 Jun 02 '18 at 22:51
  • @IndranilPaul Glad it helped you. No worries. – MKR Jun 02 '18 at 22:55
  • @andrew_reece... I am getting the following error. Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘funs’ for signature ‘"numeric"’. Getting the same error with both the solutions – ip2018 Jun 22 '18 at 20:34