1

Let's say I have this dataframe:

> df <- data.frame(a1=c(1, 2, 3, 4, 5), b1=c(5, 4, 3, 2, 1), b2=c(10, 11, 12, 13, 14), a2=c(5, 6, 7, 8, 9), c1=c(100, 200, 300, 400, 500))
> df
  a1 b1 b2 a2  c1
1  1  5 10  5 100
2  2  4 11  6 200
3  3  3 12  7 300
4  4  2 13  8 400
5  5  1 14  9 500
> 

As you can see, I have columns a1 a2 that both start with a, I want to sum them row-wise and generate just one column of a, same for column b. Also for c column, just the c column should stay the same as it is since there are only one column that starts with c.

Desired output:

    a   b    c
1   6  15  100
2   8  15  200
3  10  15  300
4  12  15  400
5  14  15  500

Coming from a Python, in (Python Library), I could just simply use:

df.groupby(df.columns.str[:1], axis=1).sum()

But I am not sure how to do this in R.

I tried dplyr:

df %>% group_by(sub(".$", "", colnames(df))) %>%
  mutate(across(colnames(.), sum))

But an error popped up. New to R.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • 2
    `group_by` is for rows, not columns. Possible duplicate or related post: https://stackoverflow.com/q/29006056/680068 – zx8754 Oct 13 '21 at 08:35
  • 1
    Got it: `sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))` – U13-Forward Oct 13 '21 at 08:41
  • 1
    If your answer in the comments doesn't already exist in linked post, please post there. – zx8754 Oct 13 '21 at 09:29
  • 2
    @zx8754 Added [Efficiently sum across multiple columns in R](https://stackoverflow.com/a/69553014) – U13-Forward Oct 13 '21 at 09:32

1 Answers1

2
as.data.frame(lapply(unique(sapply(names(df), function(x) grep(substr(x,1,1), names(df)))), function(y) if(is.na(y[2])) df[[y[1]]] else df[[y[1]]] + df[[y[2]]]))
  c.6..8..10..12..14. c.15..15..15..15..15. c.100..200..300..400..500.
1                   6                    15                        100
2                   8                    15                        200
3                  10                    15                        300
4                  12                    15                        400
5                  14                    15                        500

You can then set names using:

unique(substr(names(df),1,1))
[1] "a" "b" "c"

As commented:

sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
Karthik S
  • 11,348
  • 2
  • 11
  • 25