Match and summarize dataframe in R

Question

I have a dataframe with timeseries data columns and start year and end year.

df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64),   y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))

For each row I want to calculate mean value within the start and end years and also before and after the start and end period. The desired output looks like:

y2000   y2001   y2002   y2003   y2004   y2005   y2006   y2007   y2008   y2009   y2010   startyear   endyear before_mean within_mean  after_mean
12  234 3434    36  6   56  43  6   64  63  65  2006    2009    629.6666667     44  65
636 76  46  35  64  65  65  56  66  5656    54  2001    2005    636 57.2        1179.4

I have tried different match and index technique but can't wrap my mind around this one.

akrun · Accepted Answer · 2015-03-21T20:03:42.627

1. dplyr/tidyr

It may be better to convert the 'wide' format to 'long' format. We could use dplyr/tidyr to get the mean. Create a 'ind' column, reshape the data to 'long' using gather, split the 'variable' column into two columns ('var1', 'var2') with extract, group by 'ind', get the mean values of the 'value' column after subsetting it based on the different logical index created (i.e. var2 < startyear, var2 >= startyear & var2 <= endyear, and var2 >endyear)

library(dplyr)
library(tidyr)

dS <-  df %>%
          mutate(ind=row_number()) %>%
          gather(variable, value, starts_with('y')) %>%
          extract(variable, c('var1', 'var2'), '([^0-9]+)([0-9]+)',
                        convert=TRUE) %>%
          group_by(ind) %>%
          summarise(before_mean= mean(value[var2 < startyear]), 
                   within_mean = mean(value[var2 >= startyear & 
                                            var2 <= endyear]),
                   after_mean=mean(value[var2 >endyear])) %>% 
         as.data.frame()

nm1 <-  paste(c('before', 'within', 'after'), 'mean', sep="_")
dS
#   ind before_mean within_mean after_mean
#1   1    629.6667        44.0       65.0
#2   2    636.0000        57.2     1179.4

We can create additional columns in 'df' from the above output

df[nm1] <- dS

2. base R

We can use base R methods and without changing the format of the dataset. From the original dataset ('df'), make an index ('indx') of numeric column names, remove the non-numeric part and convert to numeric ('v1').

 indx <- grep('\\d+', names(df))
 v1 <- as.numeric(sub('[^0-9]+', '', names(df)[indx]))

Loop the rows of 'df' (lapply), match the 'startyear' with 'v1', use that index ('i1') to get the columns, unlist, and calculate the mean. The same can be done by matching the 'endyear' with 'v1' to get the index ('i2'). Based on 'i1', and 'i2', calculate the 'within_mean' and 'after_mean'. rbind the list elements and assign the output to new columns ('nm1') in 'df'.

df[nm1] <- do.call(rbind,lapply(1:nrow(df), function(i) {
       i1 <- match(df$startyear[i], v1)
       before_mean<-  mean(unlist(df[i,1:(i1-1),drop=FALSE]))
       i2 <- match(df$endyear[i], v1)
       within_mean <- mean(unlist(df[i,i2:i1]))
      after_mean <- mean(unlist(df[i,match(v1[(i2+1):length(v1)],v1)]))
       data.frame(before_mean,within_mean, after_mean) }))
 df[nm1]
 #    before_mean within_mean after_mean
 #1    629.6667        44.0       65.0
 #2    636.0000        57.2     1179.4

Dominic Comtois · Answer 2 · 2015-03-21T20:16:17.700

Another approach than akrun's, also using Base R. We'll create an intermediate variable in the same order as column names but having numerical format. This will be used to refer to columns of the actual dataframe:

col.years <- suppressWarnings(as.numeric(sub("^y", "", colnames(df))))[1:11]

# Initialise everything to NA (better when preparing to loop over df)        
df$before_mean <- NA
df$within_mean <- NA
df$after_mean <- NA

for(i in seq_len(nrow(df))) {
    df$before_mean[i] <- mean(as.numeric(df[i, which(col.years < df$startyear[i])]))
    df$within_mean[i] <- mean(as.numeric(df[i, which((col.years >= df$startyear[i]) & (col.years <= df$endyear[i]))]))
    df$after_mean[i]  <- mean(as.numeric(df[i, which(col.years > df$endyear[i])]))
}

Results

df[,14:16]

#   before_mean within_mean after_mean
# 1    629.6667        44.0       65.0
# 2    636.0000        57.2     1179.4

score 1 · Answer 3 · answered Mar 21 '15 at 20:07

Here is a solution:

#The original data:
df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64),   y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))

df$s = df$startyear - 1999
df$e = df$endyear - 1999
df$before_mean <- apply(df, 1, function(x)sum(x[1:(x[14]-1)] ))
df$within_mean <- apply(df, 1, function(x)sum(x[x[14]:x[15]] ))
df$after_mean <- apply(df, 1, function(x)sum(x[(x[15]+1):11] ))
df$s <- NULL
df$e <- NULL

This solution is tied to the exact years as in the example, but it would not be too hard to make it more generic.

Match and summarize dataframe in R

3 Answers3