Split Data Set into Group and then split those groups out by age in R

Question

I am trying to split my data set up for analyses in R. I first want to split them by group, A or B, and then split those groups up by age. I have tried using the split() function as follows:

Data <- read.csv("/users/SLA9DI/Documents/Test.csv")
split(Data,Data$Group)

But then when i try split(Data,Data$Age) it splits it by only age, and the same thing happens when i try split(Data$Group,Data$Age). The data will be used to compare groups of people who are the same age. I also might throw in gender later, so if i could do an even further split by gender within those ages, that would be even more helpful. Example:

Group   Age   Data  Data2
A         13    15  10
A         13    14  6
A         18    13  2
A          8    13  8
A         12    2   2
A         14    2   2
A         16    3   2
A         16    4   4
A         16    23  5
A         16    15  4
B         13    5   5
B         13    56  6
B         18    6   1
B          8    76  6
B         12    7   3
B         14    8   2
B         16    9   2
B         16    10  5
B         16    11  6
B         16    12  7

Edit: Split them into groups, and then split the ages within those groups up, so i can compare the 16 Year Old Group B with the 16 Year old groups A. Further, i may want to split it even further into Gender later, to say compare a 16 Year Old Female in group B or group A, with 16 Year Old Male in Group A or B.

It would be easier to answer if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) containing sample data and the exact desired result you wish to get for this sample data. — akhmed, May 06 '15 at 20:11
maybe `split(Data,interaction(Data$Group, Data$Age))` if you really want to but there are many functions and packages available that do the split/apply/combine thing better — rawr, May 06 '15 at 20:12

score 0 · Accepted Answer · answered May 06 '15 at 20:19

First off, rather than splitting the data multiple times, have you considered keeping the data together and using either by() or aggregate() with some functions to do your analysis?

Second, you simply need to apply the second split to all the output data. The simplest way to do this is to write a quick function that wraps and slightly modifies the built-in split so that you can simply pass the variable name of the variable you want to split on, rather than a vector.

The below works:

options(stringsAsFactors = FALSE)
testdata <- data.frame(Age=c(10,11,9,10,13,12,11,9,10,8,13),
                       Group=c("A","B","A","C","D","A","A","A","C","B","C"),
                       Var1=c(3,4,1,3,3,1,7,3,1,7,4))

func.split_wrapper <- function(dataframe,varname) {return(split(x = dataframe,f = dataframe[[varname]]))}

testdata.split1 <- func.split_wrapper(dataframe = testdata,varname = "Age")
testdata.split2 <- lapply(X = testdata.split1,FUN = func.split_wrapper,varname = "Group")

print(testdata.split2)

$`8`
$`8`$B
   Age Group Var1
10   8     B    7


$`9`
$`9`$A
  Age Group Var1
3   9     A    1
8   9     A    3


$`10`
$`10`$A
  Age Group Var1
1  10     A    3

$`10`$C
  Age Group Var1
4  10     C    3
9  10     C    1


$`11`
$`11`$A
  Age Group Var1
7  11     A    7

$`11`$B
  Age Group Var1
2  11     B    4


$`12`
$`12`$A
  Age Group Var1
6  12     A    1


$`13`
$`13`$C
   Age Group Var1
11  13     C    4

$`13`$D
  Age Group Var1
5  13     D    3

I also realize i could have just used the function 'subset(data, Age = 18 & Group = "Normal")' and stored that as an object instead of 'split()' — Kunio, May 09 '15 at 17:13
It is, however, generally a bad practice to use the subset command in a non-interactive environment. A better way would be processing using by() or aggregate(). — TARehman, May 09 '15 at 17:15
Ah, I see that. I read further into it and people have been recommending the [ function instead of subset() — Kunio, May 11 '15 at 17:01

score 0 · Answer 2 · answered May 06 '15 at 20:23

I might do as following. First to obtain unique pairs of group and gender by expand.grid(). Then loop over the columns.

set.seed(1237)
df <- data.frame(group = sample(c("A","B"), 10, replace = T),
                 gender = sample(c("M","F"), 10, replace = T),
                 age = sample(c(20:25), 10, replace = T))

grid <- unique(expand.grid(df$group, df$gender))
names(grid) <- c("group", "gender")
grid

#group gender
#1      A      M
#2      B      M
#11     A      F
#12     B      F

lapply(1:nrow(grid), function(x) {
  df[df$group == grid[x, 1] & df$gender == grid[x, 2],]
})

[[1]]
group gender age
1     A      M  22
3     A      M  25
4     A      M  20
8     A      M  22

[[2]]
group gender age
6      B      M  24
9      B      M  25
10     B      M  22

[[3]]
group gender age
5     A      F  20

[[4]]
group gender age
2     B      F  24
7     B      F  25

Split Data Set into Group and then split those groups out by age in R

2 Answers2