2

I'm aware that this question is simple, but couldn't find a solution without creating step objects, and I want a one-line code, or one as simplest as it could be.

Suppose I have a data frame called df with columns x, y, z:

x<-c(rep('place1',33),rep('place2',33),rep('place3',34))
y<-sample(c('type1','type2','type3','type4','type5'),100,replace=T)
z<-sample(40:80,100,replace=T)
df<-data.frame(x,y,z)

I would like to get all subsets possible of z for each combination of levels of x and y (type1 in place1, type2 in place1, type3 in place1...type4 in place3 and type5 in place3). Something like this:

[[place1]]
[type1]
[1] 57 73 74 47 52 61

[type2]
[1] 72 76 64 62 73 75
...

[type5]
...

[[place3]]
[type1]
...

[type5]

In the case this is possible, how could I access each subset?

I've tried a nested split inside an lapply, without success.

Sorry for this simple question, but couldn't find a suitable solution.

Any help would be appreciated.

JoseRamon
  • 87
  • 8

3 Answers3

4

Here is one way. You split your df using the variable, x. Then, you split each data frame using split again with the variable, y. In this way, you can subset your data in a way you want.I left a bit of trimmed outcome in the end.

lapply(split(df, f = df$x), function(x) split(x, f = x$y)

#$place1
#$place1$type1
#        x     y  z
#5  place1 type1 46
#7  place1 type1 41

#$place1$type2
#        x     y  z
#3  place1 type2 44
#4  place1 type2 59

If you just want the values for z, you can do something like this:

lapply(split(df, f = df$x), function(x) split(x$z, f = x$y))

#$place1
#$place1$type1
#[1] 46 41 50 59 54 51 66 70

#$place1$type2
#[1] 44 59 60 53 74 46 67 70

#$place1$type3
#[1] 63 70 80 44 73 74 58

#$place1$type4
#[1] 45 67 52 72 45 48 79 65

#$place1$type5
#[1] 75 54

EDIT

Seeing the link provided by @user295691, you could do the following as well.

split(df$z, interaction(df$x,df$y))

If you want each vector with z values, you could do:

list2env(split(df$z, interaction(df$x,df$y)), .GlobalEnv)

EDIT2

The OP wanted to run stats using this data. I, therefore, thought it would be a good idea to leave the following. If you need to create a data frame with different length of vectors in a list, you could do something like this. listvectors2df let you create a data frame with NA.

ana <- split(df$z, interaction(df$x,df$y))

# I used a good answer in this post and wrote the following.
#http://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data

listvectors2df <- function(l){

    n.obs <- sapply(l, length)
    seq.max <- seq_len(max(n.obs))
    mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)

}

bob <- listvectors2df(ana)
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • Thank you so much for your answer. This is exactly what I wanted. But also, I would like to access the *z* column and put it as a vector in each list element, instead of all the columns in df. Is this possible? – JoseRamon Oct 30 '14 at 03:21
  • @JoseRamon Thank you very much for your comment. I wonder if `lapply(split(df, f = df$x), function(x) split(x$z, f = x$y))` does what you mentioned. If you want want each vector, you could do something like `list2env(split(df$z, interaction(df$x,df$y)), .GlobalEnv)`. Then, type `ls()`. You will see all vectors. Let me know if this is what you want. – jazzurro Oct 30 '14 at 03:35
  • I've just tested the `list2env` solution. Didn't know it was possible to create a vector with each list element. This is really useful for me, not only for this particular question, but for lots of other problems I've had. Let me ask you another question, now that we are in this step: could I run, for example a `wilcox.test` between all pairs possible (not paired, of course)? For example: place1.type1 with place1.type2...place5.type4 with place5.type5. Thanks – JoseRamon Oct 30 '14 at 04:11
  • @JoseRamon Let me think what I can try. I will get back to you tomorrow. – jazzurro Oct 30 '14 at 10:24
  • @JoseRamon I don't think I can write everything here. But, you first want to create a data frame. Check [this link](http://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data). You can learn how to crate a df. I ran wilcox.test using `dplyr`. You cannot run all pairs. But you can run the test with one column against the rest. If mydf is your df, you could do something like: mydf %>% summarise_each(funs(wilcox.test(place1.type1,.)$p.value), vars=place2.type1:place3.type5). Due to ties, you have warnings. Hope this helps you. – jazzurro Nov 01 '14 at 14:29
  • Sorry @jazzurro for posting this comment so late. I will try using `dplyr` and the function suggested, and will let you know. – JoseRamon Nov 06 '14 at 02:35
  • @JoseRamon That's all right. If you run my code, you just get p-values. But, give it a go and think if this is the way you wanna go or not. There are various ways to go around. :) – jazzurro Nov 06 '14 at 02:41
  • Ok, it works, and is enough for the results I want to achieve. Thanks. – JoseRamon Nov 12 '14 at 15:42
  • @JoseRamon Great to hear that. :) – jazzurro Nov 12 '14 at 16:07
3

Can also use split with interaction:

split(df, interaction(x,y))
$place1.type1
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54

$place2.type1
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

$place3.type1
        x     y  z
85 place3 type1 54

To access each element:

> ll = split(df, interaction(x,y))
> 
> ll[[1]]
        x     y  z
6  place1 type1 57
25 place1 type1 55
27 place1 type1 55
28 place1 type1 75
29 place1 type1 54
> 
> ll[[2]]
        x     y  z
36 place2 type1 70
42 place2 type1 69
45 place2 type1 78
57 place2 type1 79
59 place2 type1 46
60 place2 type1 45
63 place2 type1 73
64 place2 type1 79

data.table can also be used:

library(data.table)
dtt = data.table(df)

dtt[order(x,y),list(meanz=mean(z), maxz=max(z), sumz=sum(z)),by=list(x,y)]
         x     y    meanz maxz sumz
 1: place1 type1 63.11111   80  568
 2: place1 type2 68.12500   79  545
 3: place1 type3 58.80000   76  294
 4: place1 type4 59.83333   79  359
 5: place1 type5 59.40000   80  297
 6: place2 type1 55.85714   69  391
 7: place2 type2 59.71429   71  418
 8: place2 type3 61.00000   76  305
 9: place2 type4 53.63636   71  590
10: place2 type5 44.66667   46  134
11: place3 type1 62.16667   74  373
12: place3 type2 63.42857   80  444
13: place3 type3 64.00000   77  384
14: place3 type4 61.28571   80  429
15: place3 type5 51.00000   60  408
rnso
  • 23,686
  • 25
  • 112
  • 234
  • Really interesting the solution from package `data.table` and the `interaction` argument of split. Thanks. – JoseRamon Oct 30 '14 at 03:42
2

There are a couple of solutions. The first is the lapply/split that jazzurro has provided. You could also combine the factors into a single factor, e.g.

> split(df, paste(df$x, df$y))
$`place1 type1`
        x     y  z
3  place1 type1 57
24 place1 type1 54

$`place1 type2`
        x     y  z
1  place1 type2 67
6  place1 type2 75
7  place1 type2 72
12 place1 type2 57
...

The other solution would be to use a library that has intrinsic support for multi-level grouping, like data.tables or plyr/dplyr. In dplyr, the operation would look like (including the summary, in this case the mean and max of the third column)

> df %>% group_by(x, y) %>% summarise(mean(z), max(z))
Source: local data frame [15 x 4]
Groups: x

        x     y  mean(z) max(z)
1  place1 type1 55.50000     57
2  place1 type2 65.50000     80
3  place1 type3 60.40000     78
4  place1 type4 57.12500     73
...
user295691
  • 7,108
  • 1
  • 26
  • 35