Sum rows in data.frame or matrix

Question

I have a very large dataframe with rows as observations and columns as genetic markers. I would like to create a new column that contains the sum of a select number of columns for each observation using R.

If I have 200 columns and 100 rows, then I would like a to create a new column that has 100 rows with the sum of say columns 43 through 167. The columns have either 1 or 0. With the new column that contains the sum of each row, I will be able to sort the individuals who have the most genetic markers.

I feel it is something close to:

data$new=sum(data$[,43:167])

score 141 · Answer 1 · answered Oct 21 '10 at 21:08

141

you can use rowSums

rowSums(data) should give you what you want.

answered Oct 21 '10 at 21:08

Greg

11,564
5
41
27

18

And for OP problem `data$new <- rowSums(data[43:167])` – Marek Oct 21 '10 at 21:14
12

To save someone's time, perhaps: avoid confusion with function `rowsum` which does something else! – Augustin Jan 01 '16 at 12:27

score 49 · Answer 2 · answered Oct 21 '10 at 21:17

49

The rowSums function (as Greg mentions) will do what you want, but you are mixing subsetting techniques in your answer, do not use "$" when using "[]", your code should look something more like:

data$new <- rowSums( data[,43:167] )

If you want to use a function other than sum, then look at ?apply for applying general functions accross rows or columns.

answered Oct 21 '10 at 21:17

Greg Snow

48,497
6
83
110

I am not sure why i got this error: Error in rowSums(incomeData) : 'x' must be numeric – munmunbb Nov 19 '17 at 20:29
1

@munmunbb, you received that error because `incomeData` is not numeric. Use something like `str(incomeData)` to see what it is, then possibly convert it to a numeric matrix. – Greg Snow Nov 20 '17 at 18:06

score 11 · Answer 3 · answered Jul 13 '18 at 00:30

I came here hoping to find a way to get the sum across all columns in a data table and run into issues implementing the above solutions. A way to add a column with the sum across all columns uses the cbind function:

cbind(data, total = rowSums(data))

This method adds a total column to the data and avoids the alignment issue yielded when trying to sum across ALL columns using the above solutions (see the post below for a discussion of this issue).

Adding a new column to matrix error

See also [dplyr::mutate_all](https://dplyr.tidyverse.org/reference/summarise_all.html). — Paul Rougieux, Nov 27 '18 at 10:41

score 6 · Answer 4 · answered Mar 30 '22 at 00:00

Just for completeness. I will list other methods no mentioned here this is different ways for do it the same thing using dplyr syntax with a matrix:

mat = matrix(1:12, ncol = 3)

library(dplyr)

mat %>% as_tibble() %>% 
   mutate(sum = rowSums(across(where(is.numeric))))

# A tibble: 4 x 4
     V1    V2    V3   sum
  <int> <int> <int> <dbl>
1     1     5     9    15
2     2     6    10    18
3     3     7    11    21
4     4     8    12    24

or c_across:

mat %>% as_tibble() %>%
  rowwise() %>% 
  mutate(sumrange = sum(c_across(), na.rm = T))

or selecting specific column by column name:

mat %>% as_tibble() %>%
    mutate( 'B1' = V1, B2 = V2) %>% 
    rowwise() %>% 
    mutate(sum_startswithB = 
sum(c_across(starts_with("B")), na.rm = T))

     V1    V2    V3    B1    B2 sum_startswithx
  <int> <int> <int> <int> <int>           <int>
1     1     5     9     1     5               6
2     2     6    10     2     6               8
3     3     7    11     3     7              10
4     4     8    12     4     8              12

by column index in this case the first column to 4th column :

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  rowwise() %>% 
  mutate(SumByIndex = sum(c_across(c(1:4)), na.rm = T))

     V1    V2    V3    B1    B2 SumByIndex
  <int> <int> <int> <int> <int>      <int>
1     1     5     9     1     5         16
2     2     6    10     2     6         20
3     3     7    11     3     7         24
4     4     8    12     4     8         28

Using Regular Expresion:

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  mutate(sum_V = rowSums(.[grep("V[2-3]", names(.))], na.rm = TRUE),
  sum_B = rowSums(.[grep("B", names(.))], na.rm = TRUE))

     V1    V2    V3    B1    B2 sum_V sum_B
  <int> <int> <int> <int> <int> <dbl> <dbl>
1     1     5     9     1     5    14     6
2     2     6    10     2     6    16     8
3     3     7    11     3     7    18    10
4     4     8    12     4     8    20    12

Using Apply Funcion is more handy because you can choose sum, mean, max, min, variance and standard deviation across columns.

mat %>% as_tibble() %>%
  mutate( 'B1' = V1, B2 = V2) %>%
  mutate(sum = select(., V1:B1) %>% apply(1, sum, na.rm=TRUE)) %>%
  mutate(mean = select(., V1:B1) %>% apply(1, mean, na.rm=TRUE)) %>%
  mutate(max = select(., V1:B1) %>% apply(1, max, na.rm=TRUE)) %>%
  mutate(min = select(., V1:B1) %>% apply(1, min, na.rm=TRUE)) %>%  
  mutate(var = select(., V1:B1) %>% apply(1, var, na.rm=TRUE)) %>%
  mutate(sd = select(., V1:B1) %>% apply(1, sd, na.rm=TRUE))

     V1    V2    V3    B1    B2   sum  mean   max   min   var    sd
  <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <dbl>
1     1     5     9     1     5    16     4     9     1  14.7  3.83
2     2     6    10     2     6    20     5    10     2  14.7  3.83
3     3     7    11     3     7    24     6    11     3  14.7  3.83
4     4     8    12     4     8    28     7    12     4  14.7  3.83

Note: the var and sd same output is not an error is because the data is generated linearly 1:12 you can verify calculating the values of the first columns:

> sd(c(1,5,9,1))
[1] 3.829708
> sd(c(2,6,10,2))
[1] 3.829708

You may consider updating this...as of dplyr [1.1.0](https://dplyr.tidyverse.org/news/index.html#dplyr-110) it is not recommended using `across` in this manner. They have introduced `pick` for tidy selecting and returning a tibble. — LMc, Aug 23 '23 at 17:17

Hamzah · Answer 5 · 2021-12-14T20:24:54.457

I will try to support you with the elapsed time for each method by an example:

mat = matrix(runif(4e6), ncol = 50)

Comprison between apply function and rowSums:

apply_func <- function(x) {
    apply(x, 1, sum)
}

r_sum <- function(x) {
    rowSums(x)
}

# Compare the methods
microbenchmark(
    apply_func = app(mat),
    r_sum = r_sum(mat), times = 1e5
)

------ output -- in milliseconds --------

       expr       min        lq      mean    median        uq      max neval
 apply_func 207.84661 260.34475 280.14621 279.18782 294.85119 354.1821   100
      r_sum  10.76534  11.53194  13.00324  12.72792  14.34045  16.9014   100

As you notice that the mean time for the rowSums function is 21 times smaller than the mean time of the apply function. You will find that the difference in the elapsed time may be more significant if the matrix has too many columns.

The main goal is the idea regardless of the dataset I am working on, what is applied to a small matrix will be usually applied to a large benchmark. — Hamzah, Dec 14 '21 at 19:48

score 1 · Answer 6 · answered Aug 07 '21 at 05:29

1

This could also help, however the best option is beyond any doubt the rowSums function:

data$new <- Reduce(function(x, y) {
  x + data[, y]
}, init = data[, 43], 44:167)

answered Aug 07 '21 at 05:29

Anoushiravan R

21,622
3
18
41

score 1 · Answer 7 · edited Dec 03 '21 at 12:20

You can also use this function adorn_totals from janitor package. You can sum the columns or the rows depending on the value you give to the arg: where.

Example:

tibble::tibble(
a = 10:20,
b = 55:65,
c = 2010:2020,
d = c(LETTERS[1:11])) %>%
janitor::adorn_totals(where = "col") %>%
tibble::as_tibble()

Result:

# A tibble: 11 x 5
       a     b     c d     Total
   <int> <int> <int> <chr> <dbl>
 1    10    55  2010 A      2065
 2    11    56  2011 B      2067
 3    12    57  2012 C      2069
 4    13    58  2013 D      2071
 5    14    59  2014 E      2073
 6    15    60  2015 F      2075
 7    16    61  2016 G      2077
 8    17    62  2017 H      2079
 9    18    63  2018 I      2081
10    19    64  2019 J      2083
11    20    65  2020 K      2085

Sum rows in data.frame or matrix

7 Answers7

Linked

Related