create a new column which is the sum of specific columns (selected by their names) in dplyr

Question

My question is how to create a new column which is the sum of some specific columns (selected by their names) in dplyr. For example, with iris dataset, I create a new columns called Petal, which is the sum of Petal.Length and Petal.Width.

iris %>% mutate(Petal = Petal.Length+Petal.Width)

Now imagine I have a dataset with 20 columns with 'Petal' in their names. I want to create a column 'Petal' which sum up all those columns. I definitely do not want to type all the columns names in my code. Feel like there should be achievable with one line of code in dplyr. Appreciate if anyone can help.

Sounds like your data are not in a tidy format. Better to `gather()` to long format and `summarize()`. There are many existing questions out there on "reshaping wide to long". — MrFlick, Dec 11 '17 at 18:46

score 25 · Answer 1 · answered Dec 11 '17 at 18:53

I agree with MrFlick that tidying your data is preferable---especially if you want to do anything other than sum these columns---but here is one option:

iris %>% mutate(
   Petal = rowSums(select(., starts_with("Petal")))
)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal
1            5.1         3.5          1.4         0.2     setosa   1.6
2            4.9         3.0          1.4         0.2     setosa   1.6
3            4.7         3.2          1.3         0.2     setosa   1.5
4            4.6         3.1          1.5         0.2     setosa   1.7
5            5.0         3.6          1.4         0.2     setosa   1.6
...

See ?select helper options other than starts_with() for selecting columns.

skd · Answer 2 · 2019-02-05T14:04:11.693

5

Using reduce() from purrr is slightly faster than rowSums and definately faster than apply, since you avoid iterating over all the rows and just take advantage of the vectorized operations:

library(purrr)
library(dplyr)
iris %>% mutate(Petal = reduce(select(., starts_with("Petal")), `+`))

See this for timings

edited Feb 05 '19 at 14:04

answered Feb 05 '19 at 04:14

skd

1,865
1
21
29

Evan Hegarty · Answer 3 · 2020-01-03T19:58:15.290

2

iris %>% mutate(Petal = rowSums(.[grep("Petal", names(.))], na.rm = T))

This should work.

names(.) selects the names from your dataframe, grep searches through these to find ones that match a regex ("Petal"), and rowSums adds the value of each column, assigning them to your new variable Petal. The ".[]" syntax is a work-around for the way that dplyr passes column names.

edited Jan 03 '20 at 19:58

answered Jan 03 '20 at 19:02

Evan Hegarty

31
5

LMc · Answer 4 · 2023-04-24T15:22:04.303

I gave a similar answer here and here. rowSums is the best option if your aggregating function is sum:

iris %>% 
  mutate(Petal = rowSums(pick(starts_with("Petal"))))

But you can use c_across and rowwise:

iris %>% 
  rowwise() %>% 
  mutate(Petal = sum(c_across(starts_with("Petal")))) %>% 
  ungroup()

The big advantage is that you can use other functions besides sum. rowSums is a better option because it's faster, but if you want to apply another function other than sum this is a good option.

You can use any of the tidyselect options within c_across and pick to select columns by their name, position, class, a range of consecutive columns, etc.

Output

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
 1          5.1         3.5          1.4         0.2 setosa    1.6
 2          4.9         3            1.4         0.2 setosa    1.6
 3          4.7         3.2          1.3         0.2 setosa    1.5
 4          4.6         3.1          1.5         0.2 setosa    1.7
 5          5           3.6          1.4         0.2 setosa    1.6
 6          5.4         3.9          1.7         0.4 setosa    2.1
 7          4.6         3.4          1.4         0.3 setosa    1.7
 8          5           3.4          1.5         0.2 setosa    1.7
 9          4.4         2.9          1.4         0.2 setosa    1.6
10          4.9         3.1          1.5         0.1 setosa    1.6
# ... with 140 more rows

score 1 · Answer 5 · answered Dec 11 '17 at 19:18

More generally, create a key for each observation (e.g., the row number using mutate below), move the columns of interest into two columns, one holds the column name, the other holds the value (using melt below), group_by observation, and do whatever calculations you want. Below, I add a column using mutate that sums all columns containing the word 'Petal' and finally drop whatever variables I don't want (using select).

require(dplyr)
require(reshape2)

iris %>%
  mutate(observation = 1:nrow(iris)) %>% 
  melt(
    measure.vars = grep("Petal", colnames(iris)), 
    variable.name = 'variable', 
    value.name = 'value') %>% 
  group_by(observation) %>%
  mutate(Petal = sum(value)) %>% 
  select(-observation, -variable, -value)

While the above can be shortened, I thought this version would provide some guidance.

create a new column which is the sum of specific columns (selected by their names) in dplyr

5 Answers5

Linked

Related