1

I have a data frame made of 420 rows and 37 columns from insect field sampling data. I am trying to sum column values every 5 rows so that every 5 rows becomes just 1. In addition I am trying to make the non numerical values in my dataframe($site, $date, $plot) "collapse" so they fit the single new row made of the sums I mentioned.

I have tried using advice from this thread:

Summing columns on every nth row of a data frame in R

Including using package dplyr's summarize_each function as well as gl and colSums that are part of data.table.

I have tried these on my whole dataframe

library(data.table) setDT(FinalData)[, as.list(colSums(.SD)), by = gl(ceiling(420/5), 5, 420)]

this gives me:

Error in colSums(.SD) : 'x' must be numeric

and 

library(dplyr)
 FinalData %>%
   group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
   summarise_each(funs(sum))

which gives me error: 
Error in Summary.factor(c(4L, 4L, 4L, 4L, 4L), na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

  Site.Date.Plot  CarA  CarB    CarC...
1.SL.VI.1          0     0        1
2.SL.VI.1          0     0        0
3.SL.VI.1          0     6        0
4.SL.VI.1          0     0        3
5.SL.VI.1          1     0        0
...
every 5 rows has a different $Site.Date.Plot. I expect this:

  Site.Date.Plot  CarA  CarB    CarC...
1. SL.VI.1          1     6       4
2. SL.VI.2         ...   ...      ...

But I get the above error messages from above.
cjag
  • 11
  • 2
  • Hi, and welcome to SO! Could you provide a sample of data so we can see the situation? Take a look at [how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Calum You May 08 '19 at 00:09
  • Hi Calum, I provided a small portion of my data in my post, should I post more than that? Do you need my whole datasheet? – cjag May 08 '19 at 00:24
  • It's better to use ```dput``` so people can copy and paste. Also, including 20ish rows may have been better just to verify that every 5 rows there is a different Site.Date.Plot. So, ```dput(FinalData[1:20, 1:4])``` would have been great. – Cole May 08 '19 at 02:26

3 Answers3

1

Does this work?

FinalData %>%
   group_by(Site.Date.Plot) %>% 
   summarise_all(list(sum))

This will preserve the first column and sum the rest.

RDotC
  • 31
  • 4
0

You could try using mutate_at if you want to keep the first column Site.Date.Plot

library(dplyr)

FinalData %>%
  group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
  mutate_at(2:37, sum, na.rm = TRUE) %>%
  slice(1)

Or it will also work with summarise_at but you'll loose the first column in this case

FinalData %>%
  group_by(indx = gl(ceiling(420/5), 5, 420)) %>%
  summarise_at(-1, sum, na.rm = TRUE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • mutate_at definitely worked. Is there a way I can get the output as a new dataframe and not just in the console? Many thanks. – cjag May 08 '19 at 00:56
  • just assign it to an object. `new_data <- FinalData %>% group_by(indx = gl(ceiling(420/5), 5, 420)) %>% mutate_at(2:37, sum, na.rm = TRUE) %>% slice(1)` – Ronak Shah May 08 '19 at 01:05
0

The first issue is that in your data.table and dplyr solutions, the sum function is operating on the factor of Site.Date.Plot. That's why you are receiving the error.

The second improvement is that your factor Site.Date.Plot changes every 5 rows. In other words, we can simply group by that instead of creating a different index.

These solutions are untested since there isn't enough sample data.

library(tidyverse)

 FinalData %>%
   group_by(Site.Date.Plot) %>%
   summarise_all(sum)

library(data.table)

setDT(FinalData)
FinalData[, lapply(.SD, sum), by = 'Site.Date.Plot']

#base R
aggregate(x = FinalData[, -1], by = list(FinalData$Site.Date.Plot), FUN = sum)
Cole
  • 11,130
  • 1
  • 9
  • 24