How would I group by columns with equivalent prefix and sum them?

Question

Let's say I have this dataframe:

> df <- data.frame(a1=c(1, 2, 3, 4, 5), b1=c(5, 4, 3, 2, 1), b2=c(10, 11, 12, 13, 14), a2=c(5, 6, 7, 8, 9), c1=c(100, 200, 300, 400, 500))
> df
  a1 b1 b2 a2  c1
1  1  5 10  5 100
2  2  4 11  6 200
3  3  3 12  7 300
4  4  2 13  8 400
5  5  1 14  9 500
>

As you can see, I have columns a1 a2 that both start with a, I want to sum them row-wise and generate just one column of a, same for column b. Also for c column, just the c column should stay the same as it is since there are only one column that starts with c.

Desired output:

    a   b    c
1   6  15  100
2   8  15  200
3  10  15  300
4  12  15  400
5  14  15  500

Coming from a Python, in pandas (Python Library), I could just simply use:

df.groupby(df.columns.str[:1], axis=1).sum()

But I am not sure how to do this in R.

I tried dplyr:

df %>% group_by(sub(".$", "", colnames(df))) %>%
  mutate(across(colnames(.), sum))

But an error popped up. New to R.

`group_by` is for rows, not columns. Possible duplicate or related post: https://stackoverflow.com/q/29006056/680068 — zx8754, Oct 13 '21 at 08:35
Got it: `sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))` — U13-Forward, Oct 13 '21 at 08:41
If your answer in the comments doesn't already exist in linked post, please post there. — zx8754, Oct 13 '21 at 09:29
@zx8754 Added [Efficiently sum across multiple columns in R](https://stackoverflow.com/a/69553014) — U13-Forward, Oct 13 '21 at 09:32

score 2 · Accepted Answer · edited Oct 13 '21 at 09:16

2

as.data.frame(lapply(unique(sapply(names(df), function(x) grep(substr(x,1,1), names(df)))), function(y) if(is.na(y[2])) df[[y[1]]] else df[[y[1]]] + df[[y[2]]]))
  c.6..8..10..12..14. c.15..15..15..15..15. c.100..200..300..400..500.
1                   6                    15                        100
2                   8                    15                        200
3                  10                    15                        300
4                  12                    15                        400
5                  14                    15                        500

You can then set names using:

unique(substr(names(df),1,1))
[1] "a" "b" "c"

As commented:

sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))

edited Oct 13 '21 at 09:16

U13-Forward

69,221
14
89
114

answered Oct 13 '21 at 09:05

Karthik S

11,348
2
11
25

1

Thanks I already got an answer though, `sapply(unique(sub(".$", "", colnames(df))), function(x) rowSums(df[startsWith(colnames(df), x)]))`. – U13-Forward Oct 13 '21 at 09:10
1

@U12-Forward, yes saw the answer but wanted to try a different approach. – Karthik S Oct 13 '21 at 09:12
True! It was a good question. I have a similar code optimization to do thanks to this answer. – Francisco Maria Calisto Oct 13 '21 at 09:23
1

@U12-Forward Because KarthikS was already typing up the answer before I closed the post. – zx8754 Oct 13 '21 at 09:28

How would I group by columns with equivalent prefix and sum them?

1 Answers1