R subsetting a data frame into multiple data frames based on multiple column values

Question

I am trying to subset a data frame, where I get multiple data frames based on multiple column values. Here is my example

>df
  v1   v2   v3   v4   v5
   A    Z    1    10   12
   D    Y    10   12    8
   E    X    2    12   15
   A    Z    1    10   12
   E    X    2    14   16

The expected output is something like this where I am splitting this data frame into multiple data frames based on column v1 and v2

>df1
 v3   v4   v5
  1   10   12
  1   10   12
>df2
 v3   v4   v5
 10   12    8
>df3
 v3   v4   v5
 2    12   15
 2    14   16

I have written a code which is working right now but don't think that's the best way to do it. There must be a better way to do it. Assuming tab is the data.frame having the initial data. Here is my code:

v1Factors<-levels(factor(tab$v1))
v2Factors<-levels(factor(tab$v2))

for(i in 1:length(v1Factors)){
  for(j in 1:length(v2Factors)){
    subsetTab<-subset(tab, v1==v1Factors[i] & v2==v2Factors[j], select=c("v3", "v4", "v5"))
    print(subsetTab)
  }
}

Can someone suggest a better method to do the above?

Do you want to reuse those dataframes, or just print them grouped by those columns? — Thilo, Mar 13 '13 at 04:45
I want to reuse them.... want to plot graphs on those data frames. — Rachit Agrawal, Mar 13 '13 at 04:46

mnel · Accepted Answer · 2013-03-13T05:17:18.003

40

You are looking for split

split(df, with(df, interaction(v1,v2)), drop = TRUE)
$E.X
  v1 v2 v3 v4 v5
3  E  X  2 12 15
5  E  X  2 14 16

$D.Y
  v1 v2 v3 v4 v5
2  D  Y 10 12  8

$A.Z
  v1 v2 v3 v4 v5
1  A  Z  1 10 12

As noted in the comments

any of the following would work

library(microbenchmark)
microbenchmark(
                split(df, list(df$v1,df$v2), drop = TRUE), 
               split(df, interaction(df$v1,df$v2), drop = TRUE),
               split(df, with(df, interaction(v1,v2)), drop = TRUE))


Unit: microseconds
                                                  expr      min        lq    median       uq      max neval
            split(df, list(df$v1, df$v2), drop = TRUE) 1119.845 1129.3750 1145.8815 1182.119 3910.249   100
     split(df, interaction(df$v1, df$v2), drop = TRUE)  893.749  900.5720  909.8035  936.414 3617.038   100
 split(df, with(df, interaction(v1, v2)), drop = TRUE)  895.150  902.5705  909.8505  927.128 1399.284   100

It appears interaction is slightly faster (probably due the fact that the f = list(...) are just converted to an interaction within the function)

Edit

If you just want use the subset data.frames then I would suggest using data.table for ease of coding

library(data.table)

dt <- data.table(df)
dt[, plot(v4, v5), by = list(v1, v2)]

edited Mar 13 '13 at 05:17

answered Mar 13 '13 at 04:46

mnel

113,303
27
265
254

2

`split` can take a list for `f`, instead of having to use `interaction`. Not sure which is more efficient though. – A5C1D2H2I1M1N2O1R2T1 Mar 13 '13 at 04:48
Thanks for the benchmarks. In that case, @Arun's trick (`with(df, split(df, f = do.call(paste, df[1:2])))` would probably be even faster! And, that won't create unnecessary levels that need to be dropped. – A5C1D2H2I1M1N2O1R2T1 Mar 13 '13 at 04:58
@mnel Thanks for the prompt reply. Your suggestion definitely seems to work. But didn't exactly get the difference in the multiple methods you have suggested. – Rachit Agrawal Mar 13 '13 at 05:12
@mnel Looks like interaction creates all possible combination of the levels. But when I ran the above mentioned example, it didn't create all levels. Why so?? In the actual data set, it is creating.... It's baffling me. – Rachit Agrawal Mar 13 '13 at 05:36

score 8 · Answer 2 · answered Oct 24 '16 at 11:49

There's now also nest() from tidyr which is rather nice.

library(tidyr)
nestdf <- df %>% nest(v3:v5)
nestdf$data

> nestdf$data
[[1]]
# A tibble: 2 × 3
     v3    v4    v5
  <int> <int> <int>
1     1    10    12
2     1    10    12

[[2]]
# A tibble: 1 × 3
     v3    v4    v5
  <int> <int> <int>
1    10    12     8

[[3]]
# A tibble: 2 × 3
     v3    v4    v5
  <int> <int> <int>
1     2    12    15
2     2    14    16

Access individual tibbles with nestdf$data[1] and so on.

`dplyr::group_split()` is now another option, see the similar question and answer here: https://stackoverflow.com/a/57239963/10405322 — Stan Rhodes, Feb 23 '23 at 21:21

R subsetting a data frame into multiple data frames based on multiple column values

2 Answers2

Linked