1

For the application of a function to multiple smaller datasets from a larger dataset, I need to perform a split of the large dataset by multiple variables. However, for further use of the child datasets, I want to store them in a nested list with the different grouping variables as list node names (to be used with rapply).

An example:

head_mtcars <- head(mtcars, 10)

I know from here that I can split the data set using list(data$V1, data$V2), but the generated list unfortunately only keeps the grouping variable in the same level. I would be wishing for list nodes like $6$3, $8$3 etc.:

split(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), drop = T)

$`6.3`
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

$`8.3`
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4

$`4.4`
            mpg cyl  disp hp drat   wt  qsec vs am gear carb
Datsun 710 22.8   4 108.0 93 3.85 2.32 18.61  1  1    4    1
Merc 240D  24.4   4 146.7 62 3.69 3.19 20.00  1  0    4    2
Merc 230   22.8   4 140.8 95 3.92 3.15 22.90  1  0    4    2

$`6.4`
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

I also tried to change the separator but this does not help:

## only changes the naming separator to a $ but does not actually create a new list level:
split(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), drop = T, sep = "$")

$`6$3`
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

$`8$3`
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4

$`4$4`
            mpg cyl  disp hp drat   wt  qsec vs am gear carb
Datsun 710 22.8   4 108.0 93 3.85 2.32 18.61  1  1    4    1
Merc 240D  24.4   4 146.7 62 3.69 3.19 20.00  1  0    4    2
Merc 230   22.8   4 140.8 95 3.92 3.15 22.90  1  0    4    2

$`6$4`
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

I also tried to modify the code from here to be used with multiple splitting variables, but this moves the group variables to dimnames, from which I don't know how (if possible at all) to convert to nested list levels (it works perfectly when using only one grouping variable).

by(head_mtcars, list(head_mtcars$cyl, head_mtcars$gear), identity, simplify = FALSE)

: 4
: 3
NULL
------------------------------------------------------------- 
: 6
: 3
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
------------------------------------------------------------- 
: 8
: 3
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2
Duster 360        14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
------------------------------------------------------------- 
: 4
: 4
            mpg cyl  disp hp drat   wt  qsec vs am gear carb
Datsun 710 22.8   4 108.0 93 3.85 2.32 18.61  1  1    4    1
Merc 240D  24.4   4 146.7 62 3.69 3.19 20.00  1  0    4    2
Merc 230   22.8   4 140.8 95 3.92 3.15 22.90  1  0    4    2
------------------------------------------------------------- 
: 6
: 4
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
------------------------------------------------------------- 
: 8
: 4
NULL

I also tried various tidyverse approaches but also none of them really solved the problem.

In the end, I would like to have a nested list with the levels from $cyl as the first level and the levels from $gear as the level below. Any advice?

Dom42
  • 147
  • 11
  • 2
    Split once, then iterate over the list splitting again - `lapply(split(head_mtcars, ~ cyl, drop = TRUE), split, ~ gear, drop = TRUE)`. – Ritchie Sacramento May 28 '22 at 00:57
  • Great idea! It actually works with the provided example. I now tried to apply it to my actual data and it somehow fails with this error message: `Error in unique.default(x, nmax = nmax) : unique() applies only to vectors`, but I really don't understand why this is the case. – Dom42 May 28 '22 at 01:07
  • I made it work with my example using multiple intermediate steps and not using the `~` notation (this sometimes appears to be a problem: https://stackoverflow.com/questions/16681770/r-error-in-unique-defaultx-unique-applies-only-to-vectors). But as you solved my initial problem post, consider posting it as an answer, and I will accept it. – Dom42 May 28 '22 at 01:20
  • I found the cause of my error: my dataset was read in using `data.tables::fread()` and therefore a `data.table`. Apparently, there is an open bug\design decision, that this fails together with the `lapply(split)`: https://github.com/Rdatatable/data.table/issues/5392 Converting the data via `as.data.frame` worked and now your inital solution also actually worked. – Dom42 May 28 '22 at 07:24

1 Answers1

1

Here is a method to nest the splits an arbitrary number of times using reduce() and map_depth().

Note that the formula interface for split() is a relatively recent feature so if it doesn't work you may have to upgrade to a more recent version.

library(purrr)

head_mtcars <- head(mtcars, 10)

fms <- list(~cyl, ~gear, ~carb)

reduce(.x = fms, .f = ~ map_depth(.x, .depth = vec_depth(.x) - 2,  split, .y), .init = head_mtcars)

$`4`
$`4`$`4`
$`4`$`4`$`1`
            mpg cyl disp hp drat   wt  qsec vs am gear carb
Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1

$`4`$`4`$`2`
           mpg cyl  disp hp drat   wt qsec vs am gear carb
Merc 240D 24.4   4 146.7 62 3.69 3.19 20.0  1  0    4    2
Merc 230  22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

$`6`
$`6`$`3`
$`6`$`3`$`1`
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


$`6`$`4`
$`6`$`4`$`4`
               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280      19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

$`8`
$`8`$`3`
$`8`$`3`$`2`
                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2

$`8`$`3`$`4`
            mpg cyl disp  hp drat   wt  qsec vs am gear carb
Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
  • Thank you for this solution! However, I actually like your other one from the comments even a bit more, because it doesn't rely on `purrr`. – Dom42 May 28 '22 at 07:28