1

I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.

For example, with this dataset, if we split it into groups of 3 rows:

d1
  a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0

We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:

d2
  a b c d
1 1 3 0 2
2 1 0 2 0

I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.

I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.

divibisan
  • 11,659
  • 11
  • 40
  • 58
Bar-Tzur
  • 85
  • 1
  • 10

2 Answers2

1

Overview

After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.

In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).

After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.

Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.

# load necessary package ----
library(tidyverse)

# load necessary data ----
df1 <-
  read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)

# perform operations --------
df2 <-
  df1 %>%
  # split df1 into two data frames
  # based on three consecutive rows
  split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
  # for each data frame, apply the sum() function to all the columns
  map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
  # collapse data frames together
  bind_rows()

# view results -----
df2
#   a b c d
# 1 1 3 0 2
# 2 1 0 2 0

# end of script #
Cristian E. Nuno
  • 2,822
  • 2
  • 19
  • 33
  • 1
    Thank you for your response, Cristian. An issue which I should have likely mentioned initially is that the data does not output into neat multiples. Continuing from the example above, we can say that the input dataset d1 may be 8 rows but need to be divided by 3. The last few rows can be discarded but given that both your and @RichPauloo 's answers presume that we know the count in advance. I was hoping that I would be able to coerce it to fit but that doesn't seem to be the case. I appreciate any further input you can provide and apologize for not being clearer at the outset. – Bar-Tzur Sep 13 '18 at 13:40
  • 1
    Please disregard. I was able to resolve this by taking a subset of the data in advance, equal to a multiple of 3 and use this to create a variable that could be used for `rep()` and `length()` using `n<-trunc((as.numeric(NROW(d1)))/3)` – Bar-Tzur Sep 13 '18 at 14:46
  • Happy to at least point you in the right direction! Glad you were able to resolve. – Cristian E. Nuno Sep 13 '18 at 14:54
1

Create a grouping variable, group_by that variable, then summarise_all.

# your data
d <- data.frame(a = c(1,0,0,0,0,1),
                b = c(1,1,1,0,0,0),
                c = c(0,0,0,1,1,1),
                d = c(1,1,0,0,0,0))

# create the grouping variable 
d$group <- rep(c("A","B"), each = 3)

# apply the mean to all columns
library(dplyr)
d %>% 
  group_by(group) %>% 
  summarise_all(funs(sum))

Returns:

# A tibble: 2 x 5
  group     a     b     c     d
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A         1     3     0     2
2 B         1     0     3     0
Rich Pauloo
  • 7,734
  • 4
  • 37
  • 69