0

I am new to R, and so would greatly appreciate more explanation for any code you might have to help solve my issue.

I have a data.frame with groups of columns related to each other and I want to perform a calculation on each of those many groups to get new output columns. For example, many biological replicates in an experiment where I want to perform the calculations on each replicate independently before collapsing them.

I know I could use mutate in dplyr to create new columns, but I am not sure how to do this in a loop or how to use lapply type strategy to avoid re-tying the columns names each time. My biggest issue is understanding how to convert the column names into something usable by one of these strategies.

For example:

> A.1 <- c(11,12,13,4,15,6,17,18)
> A.2 <- c(2,4,5,5,19,7,5,1)
> 
> B.1 <- c (3,4,5,1,31,76,13,70)
> B.2 <- c (10,9,8,15,31,12,13,12)
> 
> C.1 <- c(1,2,3,4,5,6,7,8)
> C.2 <- c(2,4,5,8,10,12,15,18)
> 
> df <- data.frame(A.1, A.2, B.1, B.2, C.1, C.2)
>
> df 
A.1 A.2 B.1 B.2 C.1 C.2
1  11   2   3  10   1   2
2  12   4   4   9   2   4
3  13   5   5   8   3   5
4   4   5   1  15   4   8
5  15  19  31  31   5  10
6   6   7  76  12   6  12
7  17   5  13  13   7  15
8  18   1  70  12   8  18
> 

Where I want to create new columns where A.new = A.1/A.2, B.new = B.1/B.2 etc. without typing out each column name explicitly. Also note the "A" and "B" are really character strings, so typing all of them would be very messy and time consuming.

Something like this, but a general case for many column groups:

> df <- df %>% mutate(A.new = A.1/A.2)
> df <- df %>% mutate(B.new = B.1/B.2)
> df <- df %>% mutate(C.new = C.1/C.2)
> 
> df
 A.1 A.2 B.1 B.2 C.1 C.2      A.new      B.new     C.new
1  11   2   3  10   1   2  5.5000000 0.30000000 0.5000000
2  12   4   4   9   2   4  3.0000000 0.44444444 0.5000000
3  13   5   5   8   3   5  2.6000000 0.62500000 0.6000000
4   4   5   1  15   4   8  0.8000000 0.06666667 0.5000000
5  15  19  31  31   5  10  0.7894737 1.00000000 0.5000000
6   6   7  76  12   6  12  0.8571429 6.33333333 0.5000000
7  17   5  13  13   7  15  3.4000000 1.00000000 0.4666667
8  18   1  70  12   8  18 18.0000000 5.83333333 0.4444444
> 

I don't see the answer to my question on here already, but if you could point me to existing answers that would be much appreciated! I am currently thinking about the column names as containing the variable, but maybe that is not the correct way to approach this (R is also the first programming language I am learning) and so my searches for answers haven't yielded much yet.

Thank you for your guidance in advance!

Sarah
  • 1
  • If you want to use `dplyr`, it's tricky; [this question](https://stackoverflow.com/questions/26003574/use-dynamic-variable-names-in-dplyr) and [this question](https://stackoverflow.com/questions/39209987/using-functions-of-multiple-columns-in-a-dplyr-mutate-at-call) are good places to start. But I suspect things would be a lot easier if you had your data in a longer, tidy format: have a `group` column that contains "A", "B", "C", etc. and a `replicate` column with "1", "2", etc. – A. S. K. Jun 03 '20 at 02:30
  • 1
    Does this answer your question? [Using functions of multiple columns in a dplyr mutate\_at call](https://stackoverflow.com/questions/39209987/using-functions-of-multiple-columns-in-a-dplyr-mutate-at-call) – A. S. K. Jun 03 '20 at 02:30

2 Answers2

0

We can do this much more easily with split.default

lst1 <- lapply(split.default(df, sub("\\.\\d+$", "", names(df))), 
        function(x) x[[1]]/x[[2]])
df[paste0(names(lst1), ".new")] <- lst1
df
#  A.1 A.2 B.1 B.2 C.1 C.2      A.new      B.new     C.new
#1  11   2   3  10   1   2  5.5000000 0.30000000 0.5000000
#2  12   4   4   9   2   4  3.0000000 0.44444444 0.5000000
#3  13   5   5   8   3   5  2.6000000 0.62500000 0.6000000
#4   4   5   1  15   4   8  0.8000000 0.06666667 0.5000000
#5  15  19  31  31   5  10  0.7894737 1.00000000 0.5000000
#6   6   7  76  12   6  12  0.8571429 6.33333333 0.5000000
#7  17   5  13  13   7  15  3.4000000 1.00000000 0.4666667
#8  18   1  70  12   8  18 18.0000000 5.83333333 0.4444444

NOTE: We don't need any packages and can be done much easily

data

df <- structure(list(A.1 = c(11, 12, 13, 4, 15, 6, 17, 18), A.2 = c(2, 
4, 5, 5, 19, 7, 5, 1), B.1 = c(3, 4, 5, 1, 31, 76, 13, 70), B.2 = c(10, 
9, 8, 15, 31, 12, 13, 12), C.1 = c(1, 2, 3, 4, 5, 6, 7, 8), C.2 = c(2, 
4, 5, 8, 10, 12, 15, 18)), class = "data.frame", row.names = c(NA, 
-8L))
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
0

As mentioned by @A. S. K. it is easier to do the calculation if you have data in a long format.

We can use pivot_longer to get data in long format and for every row divide the first value by second value for that group of column.

library(dplyr)

df %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(cols = -row, 
                      names_to = c('.value', 'group'), 
                      names_sep = '\\.') %>%
   group_by(row) %>%
   summarise(across(A:C, list(new = ~.[1]/.[2]))) %>%
   #If you have an older version of dplyr use
   #summarise_at(vars(A:C), list(new = ~.[1]/.[2])) %>%
   select(-row) %>%
   bind_cols(df, .)


#  A.1 A.2 B.1 B.2 C.1 C.2  A_new  B_new C_new
#1  11   2   3  10   1   2  5.500 0.3000 0.500
#2  12   4   4   9   2   4  3.000 0.4444 0.500
#3  13   5   5   8   3   5  2.600 0.6250 0.600
#4   4   5   1  15   4   8  0.800 0.0667 0.500
#5  15  19  31  31   5  10  0.789 1.0000 0.500
#6   6   7  76  12   6  12  0.857 6.3333 0.500
#7  17   5  13  13   7  15  3.400 1.0000 0.467
#8  18   1  70  12   8  18 18.000 5.8333 0.444

You can specify range of column names using A:C in summarise step. Also, note that in pivot_longer step the names_sep argument is used to differentiate column group. Since you have column names as A.1, A.2 I use '.' as a separator, you might need to change it according to the column names that you have.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213