R - dplyr across: subtracting 1 column from a range of columns based on index

Question

This is my first post, and I am relatively new to R, so apologies if I have framed this poorly.

I have not found this problem described anywhere else but the initial approach is somewhat similar to that decribed here:

How to mutate several columns by column index rather than column name using across?.

I have a data frame containing time series data, where I would like to remove specific columns from a range of continous columns. In the example below, the values in 1R would be removed from columns 1A, 1B and 1C. Likewise the values in 2R would be removed from 2A, 2B and 2C.

So a dataframe like this


t | 1A | 1B| 1C|1RMV| 2A | 2B| 2C|2RMV| 
- | - -|- -|- -| - -| - -|- -|- -|- - | 
1 | 1  | 4 | 7 | 3  | 1  | 4 | 7 | 1  |   . . . . . . .
2 | 2  | 5 | 8 | 2  | 2  | 5 | 8 | 2  |
3 | 3  | 6 | 9 | 1  | 3  | 6 | 9 | 3  |

Would become this

t | 1A | 1B| 1C|1RMV| 2A | 2B| 2C|2RMV| 
 -| - -|- -|- -| - -| - -|- -|- -|- - | 
1 | -2 | 1 | 4 | 3  | 0  | 3 | 6 | 1  |   . . . . . . .
2 | 0  | 3 | 6 | 2  | 0  | 3 | 6 | 2  |
3 | 2  | 5 | 8 | 1  | 0  | 3 | 6 | 3  |

I have previously performed this 'manually' and it works just fine, however since trying to make this process more automatic I am running into problems.

As the number of columns in each group (1A,1B,1C whereas 2A,2B,2C,2D,2E etc.) is different I initially create a list with index positions of all the columns which I would like to subtract from the others like so:

#Return TRUE only for columns to be removed
df_boolean <- str_ends(colnames(df), "RMV")


#Create a 1D vector with elements of index positions of columns to be removed in Data
col_number <- ncol(Intensity_Raw_Data)
remove_indices <- c()
for(i in 1:col_number){
  if(df_boolean[i] == TRUE){
    remove_indices <- c(background_indices, i)
  }
}

Then I perform the subtraction using across from dplyr like so:


group_number <- length(remove_indices)


#Calculate subtraction for first group, probably way to do it in one loop but first column is the time column and I'm lazy

df_Subtracted <- df %>%
mutate(across(2:(remove_indices[1] - 1), ~.  - df[(remove_indices[1])]))

#Calculate subtracction for remaining groups
for(i in 2:group_number){
  df_Subtracted <- df_Subtracted %>%
  mutate(across((remove_indices[i-1] + 1):(remove_indices[i] - 1), ~.x - df[(remove_indices[i])]))

Here I run into my problem, when running this manually (i.e. manually typing out column names in across() ), the names of the columns remains the same. However when I run this using the code above column names are renamed as such:

1A$1R 1B$1R 1C$1R . . . . 2A$2R 2B$2R 2C$2R 2D$2R. . . . . .

While the output in View() appears correct using str() reveals that each column in the output (df_Subtracted) is in fact a 1 variable data frame.

I am not sure what is causing this to happen, However I think it may be to do with how I am indexing the column to be removed in across. Any help would be appreciated !

**

- UPDATE

**

I modified GuedesBF anwser slightly by using the approach used by Akrun in this post to make a generalised anwser for data divided by column name.


df_subtracted_split <- df %>%
  split.default(sub('\\d+', '', names(df))) %>%
  lapply(function(x) {names(x)[ncol(x)] <- "RMV";x}) %>%
  map(~mutate(.x, across(1:last_col(1), ~.x - RMV))) 
 
 
df_subtracted <- do.call(qpcR:::cbind.na, Data_Final)

For some reason list_rbind/list_cbind resulted in dropping off
columns, I read here that it is probably a result of some groups in my data frame having missing rows, thus I used cbind.na from qpcR instead.

Thanks GuedesBF and peter861222!

GuedesBF · Accepted Answer · 2023-05-16T16:42:11.877

0

This gets easier if we split.default() the data.frame into a list of similar data.frames, do the necessary operations, and finally bind the list back into a single data.frame

library(dplyr)
library(readr)
library(purrr)

df %>%
    select(-t) %>% 
    split.default(parse_number(names(.)) %>%
    map(~mutate(.x, across(c(2A, 2B, 2C), \(x) x - cur_data[[4]])) %>%
    list_rbind()

edited May 16 '23 at 16:42

answered May 16 '23 at 16:35

GuedesBF

8,409
5
19
37

Hi, thanks for your comment! Perhaps I should have clarified, In my data there is not an equal amount of variables for each grouping - i.e. 1A 2B 3C, and perhaps, 2A 2B 3C 4D 5E - therefore I cannot just split the data based on the group number as this will unevenly split the data and result in the wrong columns being subtracted. – engpol May 16 '23 at 17:14

score 0 · Answer 2 · answered May 16 '23 at 18:23

0

This should work:

pivot_longer(df,cols=-1,names_pattern="(\\d)(\\w+)",names_to=c("id","name")) %>%
  mutate(value=case_when(name=="RMV"~value,
                         T~value-value[name=="RMV"]),.by=c("t","id")) %>%
  pivot_wider(names_from= c("id","name"),names_sep="") 

      t  `1A`  `1B`  `1C` `1RMV`  `2A`  `2B`  `2C` `2RMV`
  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
1     1    -2     1     4      3     0     3     6      1
2     2     0     3     6      2     0     3     6      2
3     3     2     5     8      1     0     3     6      3

answered May 16 '23 at 18:23

one

3,121
1
4
24

Thanks a lot peter!, I feel like this will probably work but I am running into an error which I am struggling to solve. Running your code generates the error code: "Error in `mutate()`: ! Problem while computing `.by = c("time", "id")`. `.by` must be size 98728 or 1, not 2." I have declared time as a factor so this is not the issue, and the output of pivot_longer works fine. – engpol May 17 '23 at 14:51
Also, I am having trouble understanding parts of the mutate call. What does "T~" denote? I have looked through case_when and mutate documentation and cannot find "T" as an argument that is interpretable, is this a typo? – engpol May 17 '23 at 17:36
```.by = c("time", "id")``` tries to group column ```t``` and ```id``` (which is the 1 and 2 in 1A, 2A. etc). This allows you to do subtraction in the same group. Maybe you want to check if there is only one ```name=="RMV"``` in each subgroup – one May 17 '23 at 18:11
```T``` means ```TRUE```, it is a bad habit from me. You can switch it to ``TRUE```. – one May 17 '23 at 18:17
There is definitely only one RMV for each group, and I'm still getting the error. I have had to flip the naming convention - "A1, A2 etc." - I initially outlined as some groups possess many hundreds of columns, so was running out of letters. I changed the names_pattern accordingly - "(\\w)(\\d+|RMV)" and the output of pivot_longer still seems fine so am not sure what is causing the issue. – engpol May 17 '23 at 20:24

R - dplyr across: subtracting 1 column from a range of columns based on index

2 Answers2