Looping over dplyr/ Function using dplyr

Question

I have created a lot of dataframes like the one below:

df <- data %>% select(var1,var2,var3,var4) %>% group_by(var3,var4) %>% filter(var2 ==1) %>% summarise(var1 = mean(var1))

The output of each of these dataframes is the mean value of var1 after grouping the variable by var3 and var4 and filtering it according different variables.

The only the difference with the dataframe I provide above and the rest in my code is the filtering variable.

Since I want one nice table to present my output, in turn I use left_join in order to merge and arrange the dataframes in the way I want.

Although I have finished my analysis and got the output I wanted...

I had to filter the variable with many other variables and at the end I created 20 dataframes or so.

So my question is:

Is there any other way to create all these data frames at once using a function or a loop? Something like:

df[i]<- ....for i in 1-20..

maybe I should define and array with the variables that I want to filter and then name this array?

Any ideas more than welcome!

Thanks in advance.

score 0 · Accepted Answer · answered Sep 25 '17 at 18:15

Because it appears that your filters are not mutually exclusive (that is, a data point can be in more than one filtered group), I think that your best bet is likely to make a vector of your filters, then loop through that vector (though I would use lapply instead of a for loop).

Since you didn't provide a reproducible dataset or idea of what filters you are using, I am going to use the builtin iris data and group by species only (the code will work the same for multiple grouping variables).

First, here is a set of filters:

irisFilters <-
  c(Long = quote(Sepal.Length > 6 | Petal.Length > 4)
    , Wide = quote(Sepal.Width > 3 | Petal.Width > 1.5)
    , Boxy = quote((Sepal.Width / Sepal.Length) > 0.5)
  )

Note that these are totally arbitrary (and likely not at all meaningful), but they should give you an idea of what is possible. Importantly, not that I am using quote so that I can later pass them into the filter step.

Then, use lapply to step through each filter criteria, using !! to tell dplyr to interpret what is inside the variable. Here, I am just taking the mean of Petal.Length, as that seems to match your use case

irisSummaries <-
  irisFilters %>%
  lapply(function(thisFilter){
    iris %>%
      filter(!! thisFilter) %>%
      group_by(Species) %>%
      summarise(Petal.Length = mean(Petal.Length))
  })

This returns a list with the summarised result matching each of your conditions like this:

$Long
# A tibble: 2 x 2
     Species Petal.Length
      <fctr>        <dbl>
1 versicolor     4.502857
2  virginica     5.552000

$Wide
# A tibble: 3 x 2
     Species Petal.Length
      <fctr>        <dbl>
1     setosa     1.480952
2 versicolor     4.730000
3  virginica     5.572340

$Boxy
# A tibble: 3 x 2
     Species Petal.Length
      <fctr>        <dbl>
1     setosa     1.462000
2 versicolor     4.290909
3  virginica     5.320000

Then, you can combine them to a single table, using the name you assigned them (when creating the filter vector) as an identifier:

longSummaries <-
  irisSummaries %>%
  bind_rows(.id = "Filter")

Returns:

  Filter    Species Petal.Length
   <chr>     <fctr>        <dbl>
1   Long versicolor     4.502857
2   Long  virginica     5.552000
3   Wide     setosa     1.480952
4   Wide versicolor     4.730000
5   Wide  virginica     5.572340
6   Boxy     setosa     1.462000
7   Boxy versicolor     4.290909
8   Boxy  virginica     5.320000

And you can then use spread to create a column for each filter instead:

wideSummaries <-
  longSummaries %>%
  spread(Filter, Petal.Length)

Returns:

     Species     Boxy     Long     Wide
*     <fctr>    <dbl>    <dbl>    <dbl>
1     setosa 1.462000       NA 1.480952
2 versicolor 4.290909 4.502857 4.730000
3  virginica 5.320000 5.552000 5.572340

The code should be robust to any number of filters, any names you choose, any number of grouping variables (or groups). A bit more care will be needed if you are returning multiple variables, though in that case a wide-format may be inadvisable anyway.

Sorry for my late response and in fact sorry for not providing more information in terms of the dataset. I am new to this world...However, allow me to thank you since your solution is brilliant and saves lots of useless lines of code. No more comments, as I literally reproduced the output that I wanted just by using your code without further adjustments. Thanks a lot. — Ioannis, Sep 28 '17 at 09:13
@Ioannis , I am glad it helped. Since you are new, these might be helpful. [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) (voting and accepting answers is generally preferred to commenting; it is easier to quantify. — Mark Peterson, Sep 28 '17 at 11:31

Looping over dplyr/ Function using dplyr

1 Answers1