R: Building a loop function to aggregate variables based on variablename /index

Question

my data has the following structure. The datset is called "w12". This is only an excerpt of the data.

ID   ###   w1_panas1   ###    w1_panas2  ###  w1_panas4

1    ########       5       ##########         3      ###########         2

2     ########        4         ##########       3     ###########         1

3       ########      3          ##########      2         ###########     4

I build a new variable with the mean of the three items (w1_panas1 - w1_panas3) for each ID with the function:

w12$w**1**_panas_host <- apply (cbind(w12$w**1**_panas1, w12$w**1**_panas2, w12$w**1**_panas3), 1, mean, na.rm= "T")

The three items above are measured repeatedly over 12 weeks. For the other weeks the items are named

w**2**_panas1, w2_pana**2**, w2_panas3
w**3**_panas1, w3_panas**2**, w3_panas3
...

and the resulting variable should be named

w**2**_panas
w**3**_panas

I do not want to write the function above 12 times, but would like to use a loop function, that builds the variables automatically by only changing w1 to w2 to w3 to w4...

Can somebody help?

You _should_ have taken some time to create a reproducible example. See [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) how to do that for future questions. — talat, Dec 05 '14 at 14:17

score 0 · Answer 1 · answered Dec 05 '14 at 14:04

Your data should look like this (or you can quite easily get them in such a format):

set.seed(1)
df <- data.frame(id = rep(1:10, 12), panas1 = sample(1:500, 120),
                 panas2 = sample(1:300, 120), panas4 = sample(1:700, 120),
                 week = rep(1:12, each=10))

head(df)
    id panas1 panas2 panas4 week
1  1    133    298    215    1
2  2    186    149    405    1
3  3    286    145    636    1
4  4    452     52    100    1
5  5    101    224    289    1
6  6    445    134    147    1

library(reshape2); library(dplyr)

dfmeans <- melt(df, id.vars = c("id","week")) %>% 
group_by(id, week, variable) %>% summarise(avg = mean(value))

head(dfmeans)

  id week variable avg
1  1    1   panas1 133
2  1    1   panas2 298
3  1    1   panas4 215
4  1    2   panas1 496
5  1    2   panas2 167
6  1    2   panas4 526

The outcome is not perfectly in the format you want, but this is a fast and compact solution, that shows everything you want to know.

Thank you very much! Later on I need a hierchical data set anyways, so the format is acutally what I need. — Rebecca, Dec 05 '14 at 14:49

talat · Accepted Answer · 2014-12-05T14:42:41.610

First, some test data:

set.seed(123)

df <- data.frame(ID = rep(1:3, each = 3),
                 w1_panas1 = runif(9),
                 w1_panas2 = runif(9),
                 w1_panas3 = runif(9),
                 w2_panas1 = runif(9),
                 w2_panas2 = runif(9),
                 w2_panas3 = runif(9),
                 w3_panas1 = runif(9),
                 w3_panas2 = runif(9),
                 w3_panas3 = runif(9))
df
#  ID w1_panas1  w1_panas2 w1_panas3  w2_panas1 w2_panas2  w2_panas3  w3_panas1 w3_panas2    w3_panas3
#1  1 0.2875775 0.45661474 0.3279207 0.59414202 0.7584595 0.13880606 0.56094798 0.2743836 0.7101824014
#2  1 0.7883051 0.95683335 0.9545036 0.28915974 0.2164079 0.23303410 0.20653139 0.8146400 0.0006247733
#3  1 0.4089769 0.45333416 0.8895393 0.14711365 0.3181810 0.46596245 0.12753165 0.4485163 0.4753165741
#4  2 0.8830174 0.67757064 0.6928034 0.96302423 0.2316258 0.26597264 0.75330786 0.8100644 0.2201188852
#5  2 0.9404673 0.57263340 0.6405068 0.90229905 0.1428000 0.85782772 0.89504536 0.8123895 0.3798165377
#6  2 0.0455565 0.10292468 0.9942698 0.69070528 0.4145463 0.04583117 0.37446278 0.7943423 0.6127710033
#7  3 0.5281055 0.89982497 0.6557058 0.79546742 0.4137243 0.44220007 0.66511519 0.4398317 0.3517979092
#8  3 0.8924190 0.24608773 0.7085305 0.02461368 0.3688455 0.79892485 0.09484066 0.7544752 0.1111354243
#9  3 0.5514350 0.04205953 0.5440660 0.47779597 0.1524447 0.12189926 0.38396964 0.6292211 0.2436194727

Now the manipulations, using dplyr and tidyr:

library(dplyr)      # load the 
library(tidyr)      # required packages

df <- df %>% 
  group_by(ID) %>%                  # group the data by ID
  mutate(n = row_number()) %>%      # for each ID, create and index n
  ungroup()                         # ungroup the data

df %>% 
  gather(panas, value, -c(ID, n)) %>%                         # reshape the data to long format
  separate(panas, into = c("week", "number"), sep = "_") %>%  # split the column "panas" into two columns based on the "_" 
  group_by(ID, week, n) %>%                                   # group the data
  summarise(mean = mean(value)) %>%                           # calculate mean values for each group
  ungroup() %>%                                               # ungroup..
  spread(week, mean) %>%                                      # reshape from long to wide format
  left_join(df, ., by = c("ID", "n")) %>%                     # perform a join with the original data by ID and n so that all data is in one table
  select(-n)                                                  # drop column "n"

And the result is (notice the last 3 columns w1, w2, w3 which show the averages that you wanted):

#Source: local data frame [9 x 13]
#
#  ID w1_panas1  w1_panas2 w1_panas3  w2_panas1 w2_panas2  w2_panas3  w3_panas1 w3_panas2    w3_panas3        w1        w2        w3
#1  1 0.2875775 0.45661474 0.3279207 0.59414202 0.7584595 0.13880606 0.56094798 0.2743836 0.7101824014 0.3573710 0.4971359 0.5151713
#2  1 0.7883051 0.95683335 0.9545036 0.28915974 0.2164079 0.23303410 0.20653139 0.8146400 0.0006247733 0.8998807 0.2462006 0.3405987
#3  1 0.4089769 0.45333416 0.8895393 0.14711365 0.3181810 0.46596245 0.12753165 0.4485163 0.4753165741 0.5839501 0.3104190 0.3504549
#4  2 0.8830174 0.67757064 0.6928034 0.96302423 0.2316258 0.26597264 0.75330786 0.8100644 0.2201188852 0.7511305 0.4868742 0.5944970
#5  2 0.9404673 0.57263340 0.6405068 0.90229905 0.1428000 0.85782772 0.89504536 0.8123895 0.3798165377 0.7178692 0.6343089 0.6957505
#6  2 0.0455565 0.10292468 0.9942698 0.69070528 0.4145463 0.04583117 0.37446278 0.7943423 0.6127710033 0.3809170 0.3836943 0.5938587
#7  3 0.5281055 0.89982497 0.6557058 0.79546742 0.4137243 0.44220007 0.66511519 0.4398317 0.3517979092 0.6945454 0.5504639 0.4855816
#8  3 0.8924190 0.24608773 0.7085305 0.02461368 0.3688455 0.79892485 0.09484066 0.7544752 0.1111354243 0.6156791 0.3974613 0.3201504
#9  3 0.5514350 0.04205953 0.5440660 0.47779597 0.1524447 0.12189926 0.38396964 0.6292211 0.2436194727 0.3791869 0.2507133 0.4189367

Thank you very much! Next time I create a dataset like yours in my question. Just started with R. Can I ask an additonal question? What does the operator %>% do/mean? — Rebecca, Dec 05 '14 at 14:51
That is a "piping" operator. you can use it to create a series (pipe) of several operations after each other and pass on the data from one to the next. See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for more information. — talat, Dec 05 '14 at 14:58

R: Building a loop function to aggregate variables based on variablename /index

2 Answers2