0

I have several dataframes that I want to merge. I'm looking for a scalable solution and I've found this nice one. So I do:

library(purrr)
library(dplyr)

df_a <- data.frame(id = 1:8)
df_b <- data.frame(id = 5:10)
df_c <- data.frame(id = 2:6)
df_d <- data.frame(id = 3:6)

dfs_to_merge <- list(df_a, df_b, df_c, df_d)

dfs_to_merge %>%
  reduce(left_join, by = "id")
#>   id
#> 1  1
#> 2  2
#> 3  3
#> 4  4
#> 5  5
#> 6  6
#> 7  7
#> 8  8

Created on 2021-01-25 by the reprex package (v0.3.0)

But what if, for example, I wanted to condition whether df_c will be included in dfs_to_merge based on a the value of a variable my_condition_df_c?

Example — Not a scalable solution

If my_condition_df_c > 5 then include df_c in dfs_to_merge

my_condition_df_c <- sample(1:10, 1)

if (my_condition_df_c > 5) {
  dfs_to_merge <- list(df_a, df_b, df_c, df_d)
} else {
  dfs_to_merge <- list(df_a, df_b, df_d)
}

dfs_to_merge %>%
  reduce(left_join, by = "id")

My Problem

Consider that I may have several dataframes to merge, and that each one of them may have its own condition that determines whether it should be passed forward for merging.

my_condition_df_a <- sample(1:100, 1) ## include df_a if my_condition_df_a > 65                      
my_condition_df_b <- sample(c("foo", "blah"), 1) ## include df_b if my_condition_df_b == "foo"       
my_condition_df_d <- sample(c(NA, 1, 2, 3, NA, 19), 1) ## include df_d if my_condition_df_d is not NA

How could I elegantly condition which data frame gets in and which is not? Using if-else blocks as I did above is not a scalable solution as it will easily become messy and unreadable code.


UPDATE — I made some progress


So what I do is to make a character vector of object names, containing the names of dataframes to be included in the list later (or not). Being included in this vector is subject to specific condition per data frame.

dfs_to_merge_names <- c()

if (my_condition_df_a > 65) {
  dfs_to_merge_names <- c(dfs_to_merge_names, "df_a")
} 

if (my_condition_df_b == "foo") {
  dfs_to_merge_names <- c(dfs_to_merge_names, "df_b")
} 

if (my_condition_df_c > 5) {
  dfs_to_merge_names <- c(dfs_to_merge_names, "df_c")
} 

if (!is.na(my_condition_df_d)) {
  dfs_to_merge_names <- c(dfs_to_merge_names, "df_d")
} 

mget(dfs_to_merge_names) %>% ## https://stackoverflow.com/a/45963957/6105259
  reduce(left_join, by = "id")

I will still be happy for ideas whether this code could be shortened and more concise.

Emman
  • 3,695
  • 2
  • 20
  • 44
  • There is nothing in the `dfs_to_merge` to suggest which of your sporadic `_condition_`s would be appropriate for which frame. That is, if `character` then compare with `"foo"`, but if `integer` then either compare with `65` or `!is.na`. You're going to have to devise a way to determine that side yourself. Once you have that, let's say all of that reduces to `my_conditions` which is a `logical` for each element of your list, then you can do `reduce(dfs_to_merge[my_conditions], left_join, by = "id")`. – r2evans Jan 25 '21 at 15:04
  • @r2evans, thanks for commenting. I'm not sure I understand your point though. I updated my question, maybe it brings me closer to what you've said? In any case, my update is in the spirit of the solution I'm hoping to find, if that's helpful in any way... – Emman Jan 25 '21 at 16:00

1 Answers1

2

Here's stab that avoids the use of mget.

conditions <- c(
  (my_condition_df_a > 65),
  (my_condition_df_b == "foo"),
  (my_condition_df_c > 5),
  (!is.na(my_condition_df_d))
)
reduce(dfs_to_merge[conditions], left_join, by = "id")

An alternative, if you are not assured of the order of conditions versus frames in the list. Build your list of frames named, either manually or with lst:

dfs_to_merge <- list(df_a=df_a, df_b=df_b, df_c=df_c, df_d=df_D) # names can be anything
dfs_to_merge <- lst(df_a, df_b, df_c, df_d)                      # names will be object's symbols

Then using your condition-building:

dfs_to_merge_names <- c()

if (my_condition_df_a > 65) {
  dfs_to_merge_names <- c(dfs_to_merge_names, "df_a")
}
# ...
reduce(dfs_to_merge[dfs_to_merge_names ], left_join, by = "id")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you. However, I'm a bit confused about the second approach. We first set up `dfs_to_merge <- lst(df_a, df_b, df_c, df_d)` and then build `dfs_to_merge_names` with `if` blocks, but at the end we don't use `dfs_to_merge_names` but rather `dfs_to_merge[conditions]`? – Emman Jan 25 '21 at 19:20
  • just a thought, would have it been simpler/more concise if we had first organized this information in a table, where one column is for *names* of dataframes, and another column is for the condition? This way, we have the mapping of which condition pertains to what dataframe, and we can iterate over this? I'm wary of code that relies on order because of the possible risk of mismatches. Thus, your first suggestion is something I'd typically avoid. Leaving us with your second suggestion, which isn't really more concise than my `mget` solution (we still use `if` blocks). – Emman Jan 26 '21 at 07:13
  • referring to my previous comment, I'm demonstrating such a table: `df_names_and_conditions <- tibble::tribble(~df_name, ~condition, "df_a", sample(1:100, 1) > 65, "df_b", sample(c("foo", "blah"), 1) == "foo", "df_c", sample(1:10, 1) > 5, "df_d", sample(c(NA, 1, 2, 3, NA, 19), 1)) ` – Emman Jan 26 '21 at 07:22