1

I am trying to simplify a large dataset (52k+ rows) by finding the maximum value for every two week interval. I have already assigned week number values to every row and used the aggregate() function to find the maximum value for each week.

Simplified sample data:

week <- c(1:5, 5, 7:10)
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,conc)

aggregate(df, by=list(week), FUN=max)

However, I am stuck on how to further aggregate based on two-week intervals (ex: weeks 1&2, weeks 3&4...). It's not as simple as combining every other row since every week was sampled.

I'm assuming there's a simple solution, I just haven't found it yet.

Thanks!

keta
  • 47
  • 1
  • 7
  • Just showed this one to another SO query. Try `tapply(df$conc, rep(1:(nrow(df)/2), each=2), FUN = sum)` – Sagar Sep 08 '17 at 18:21
  • 1
    What should happen at week 5? Should the week 5 value just be returned? NA? – John Paul Sep 08 '17 at 18:21
  • Sagar: It still seems like it would throw off the pairings if there is a missing week. Unless I'm interpreting your code wrong... – keta Sep 08 '17 at 18:28
  • John Paul: Week 5 would simply be the maximum at week 5 since there is no Week 6 – keta Sep 08 '17 at 18:29
  • @keta - I ran it for the sample data set you provided. Didn't realize week 6 was missing. – Sagar Sep 08 '17 at 18:31
  • 1
    @keta Why don't you create another column called `bi_week` as `(df$week+1)%/%2` and `aggregate` based on that column? – Safwan Sep 08 '17 at 18:32
  • I suggest you read this post. Lots of good hacks in there. https://stackoverflow.com/questions/17389533/aggregate-values-of-15-minute-steps-to-values-of-hourly-steps – CCurtis Sep 08 '17 at 18:39
  • It is hard to tell if we are all get he same answers due to the random data. If you could use `set.seed` or have set values for `conc` that would make checking a little easier. – John Paul Sep 08 '17 at 18:46
  • 1
    @MusafirSafwan Clever! I wanted to do something like that but wasn't sure how to make the code work. Thanks! – keta Sep 08 '17 at 18:52
  • @JohnPaul Good tip. – keta Sep 08 '17 at 18:52

3 Answers3

1
week <- c(1:5, 5, 7:10)

bi_week <- (week+1)%/%2

conc <- rnorm(mean=50, sd=20, n=10)

df <- data.frame(week,bi_week,conc)

aggregate(df, by=list(bi_week), FUN=max)
Safwan
  • 3,300
  • 1
  • 28
  • 33
0
library(purrr)
library(dplyr)

Odds<-seq(1:max(week),2)
Evens<-seq(2,max(week),2)

map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>% 
filter(week==x | week==y) %>% select(conc) %>% max})

I first made vectors of odds and even numbers. Then using the purrr package I fed these pairwise (1&2, then 3&4 etc) into a function that uses the dplyr package to get just the correct weeks, select the conc values and take the max.

Here is the output:

> map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>% filter(week==x | week==y) %>% select(conc) %>% max})
[[1]]
[1] 68.38759

[[2]]
[1] 56.9231

[[3]]
[1] 77.23965

[[4]]
[1] 49.39443

[[5]]
[1] 49.38465

Note: you could use map2_dbl in place of map2and get a numeric vector instead

Edit: removed the part about df2 as that was an error.

John Paul
  • 12,196
  • 6
  • 55
  • 75
0

Use pracma::ceil to grab each bi-weekly pair

library(pracma)
aggregate(df, by=list(ceil(df$week/2)), FUN=max)

Output

  Group.1 week     conc
1       1    2 76.09191
2       2    4 50.20154
3       3    5 54.93041
4       4    8 69.17820
5       5   10 74.67518

ceil(df$week/2)
# 1 1 2 2 3 3 4 4 5 5
CPak
  • 13,260
  • 3
  • 30
  • 48