Split-sample design

Question

I am currently working on a large data set containing about 9,000 observations belonging to different groups. Now, I would like to use a method called split-sample design to analyze this data. Let me explain in detail what I would like to do. My data has the following structure:

GroupID  Performance   Commitment   Affect   Size
1234     5             4            2        2
1234     6             8            9        2
2235     4             3            2        5
2235     4             3            2        5
2235     2             1            7        5
2235     2             1            7        5
2235     2             6            10       5
3678     3             5            5        4
3678     7             3            5        4
3678     5             2            6        4
3678     1             4            6        4

Now, I would like to aggregate this data in the following way: For each group, I would like to use the average performance score of the first half of the group and the average commitment and affect scores of the second half of the group to create one new observation (for uneven group sizes I would like to drop one random observation within the group - e.g. the last observation in a group - to create an even group size). However, I would like to do this in two steps. First, the data should look like this:

GroupID  Performance   Commitment   Affect   Size
1234     5             8            9        2
2235     4             1            7        5
2235     4             1            7        5
3678     3             2            6        4
3678     7             4            6        4

In the next step, I would like to aggregate the data. The new data set would have one observation per group and look like this:

GroupID  Performance   Commitment   Affect   Size
1234     5             8            9        2
2235     4             1            7        5
3678     5             3            6        4

Again, please note that the last observation of group 2235 was dropped, since the group size was an uneven number.

Is there any package out there that would split and aggregate my data in this way? If not, how would you go ahead and code this? I would be very grateful for any advice, since I have currently no idea how to elegantly approach this, other than writing a bunch of for loops.

Here is the code for the above example:

groupid <- c(1234, 1234, 2235, 2235, 2235, 2235, 2235, 3678, 3678, 3678, 3678)
performance <- c(5, 6, 4, 4, 2, 2, 2, 3, 7, 5, 1)
commitment <- c(4, 8, 3, 3, 1, 1, 6, 5, 3, 2, 4)
affect <- c(2, 9, 2, 2, 7, 7, 10, 5, 5, 6, 6)
size <- c(2, 2, 5, 5, 5, 5, 5, 4, 4, 4, 4)
mydata <- data.frame(groupid, performance, commitment, affect, size)

Many thanks!!

I would start writing code in Matlab using a lot of 'for' loops, however, I thought there must surely be an easier way to do it in R. I have not used the aggregate function yet, but will look into it. Thanks! — rp1, Aug 28 '13 at 14:45
I would split the data first, then use `aggregate` on both subsets, then put it back together. — dayne, Aug 28 '13 at 14:45
[This post](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) may be useful to get started with the various aggregating functions in R. — Henrik, Aug 28 '13 at 14:50
Why do you need two steps? Also, you should be careful about your variable naming. `data` is an r function. — dayne, Aug 28 '13 at 14:53
Could you please tell what will you do if you have only one observation for one of the groupID? — Metrics, Aug 28 '13 at 15:02
If I only have one observation in one group, I will just initially drop it from my sample, i.e. there will not be any groups with one observation. I would need to do it in two steps because I will have to run a couple of tests after the first step and before the aggregation. — rp1, Aug 28 '13 at 15:05

Metrics · Answer 1 · 2013-08-28T17:00:24.593

Here is the solution:

library(plyr)
mydata1<-ddply(mydata,.(GroupID),summarize,aveper=mean(head((Performance),length(GroupID)/2)),
avecom=mean(tail((Commitment),length(GroupID)/2)),
aveaff=mean(tail((Affect),length(GroupID)/2)),avesiz=mean(Size))

> mydata1
  GroupID aveper avecom aveaff avesiz
1    1234      5  8.000      9      2
2    2235      4  2.667      8      5
3    3678      5  3.000      6      4

Update:

    mydata2<-ddply(mydata,.(GroupID),transform,aveper=mean(head((Performance),length(GroupID)/2)),
avecom=mean(tail((Commitment),length(GroupID)/2)),
aveaff=mean(tail((Affect),length(GroupID)/2)),avesiz=mean(Size),lengr=length(GroupID))

    > mydata2
   GroupID Performance Commitment Affect Size aveper avecom aveaff avesiz lengr
1     1234           5          4      2    2      5  8.000      9      2     2
2     1234           6          8      9    2      5  8.000      9      2     2
3     2235           4          3      2    5      4  2.667      8      5     5
4     2235           4          3      2    5      4  2.667      8      5     5
5     2235           2          1      7    5      4  2.667      8      5     5
6     2235           2          1      7    5      4  2.667      8      5     5
8     3678           3          5      5    4      5  3.000      6      4     4
9     3678           7          3      5    4      5  3.000      6      4     4
10    3678           5          2      6    4      5  3.000      6      4     4
11    3678           1          4      6    4      5  3.000      6      4     4

mydata2<-mydata2[-7,] # this assumes that you have already taken care of uneven groups
mydata3<-Map(function(x)head(mydata2[mydata2$GroupID==x,],head(mydata2$lengr[which(mydata2$GroupID==x)],1)/2),unique(mydata2$GroupID))

library(plyr)
mydata4<-ldply(mydata3)

mydata5<-mydata4[,c(1,6:9)]
> mydata5
  GroupID aveper avecom aveaff avesiz
1    1234      5  8.000      9      2
2    2235      4  2.667      8      5
3    2235      4  2.667      8      5
4    3678      5  3.000      6      4
5    3678      5  3.000      6      4

Perfect, thanks a lot! This is pretty much exactly what I needed (I can always figure out a way to drop the last observation of a group with an uneven number). However, is there a way to really do it in two steps as outlined above? I would need to run some tests on the data sample after step one and before the aggregation. Thanks again! — rp1, Aug 28 '13 at 15:03
Yes, I think its possible. Instead of summarize, use `transform` and it will get back to your original dataframe with these additional cols. You can then work on. — Metrics, Aug 28 '13 at 15:07
Is there any way to get the data set into the exact format outlined above after step one - i.e. for group `2235` two observations with `4` in the performance column and `1`, `7` and `5` in the commitment, affect and size columns, respectively? — rp1, Aug 28 '13 at 15:16
There should be.You can do subset. I will update if I have a solution. — Metrics, Aug 28 '13 at 15:20
Thanks! I have now coded it not very elegantly using for loops (see update in my original post), but it does the trick. — rp1, Aug 28 '13 at 16:22
Great. You should post that as an answer and accept your answer. — Metrics, Aug 28 '13 at 16:28
See my updates which assumes that you have already taken into account uneven groupid for step 1. You can test which one is efficient and use accordingly. — Metrics, Aug 28 '13 at 17:01

score 1 · Accepted Answer · answered Aug 28 '13 at 16:55

I have now coded it the following way (rather brute force). Please let me know if you know a better way of getting the trick done. In the end, I have used the code provided by Metrics in one of the answers to aggregate my data (thanks for that again!):

ids <- unique(groupid)
pos <- 1
for (i in 1:length(ids)) {
    total <- mydata[pos,5]
    num <- floor(total/2)

    for (m in pos:(pos+num-1)) {
        mydata[m,-c(1,2)] <- mydata[m+num,-c(1,2)]
    }

    for (l in (pos+num):(pos+total-1)) {
        mydata[l,] <- NA
        print(l)
    }

    pos <- pos+total

}

mydata <- mydata[!is.na(mydata$groupid),]

mydata2<-ddply(mydata,.(groupid),summarize,aveper=mean(performance),avecomm=mean(commitment), aveaff=mean(affect), avesiz=mean(size))

Split-sample design

2 Answers2