1

I'm trying to join two data sets in R using a left join in a function. Firstly I have my main data frame GE_GC referenced by the "df" in the function and I am trying to join a data frame called GE_GC_Teacher_Names however I would like the "GE_GC" part of the object name to be dynamic as I have multiple data sets with unique set of names that needs to be joined. For example if the "df" reference in my function was EX_EF then the function would join the EX_EF_Teacher_Names data frame onto the EX_EF data frame.


Q2_Table <- function(df){
  
  
  df %>% select(contains("Q2_")) %>% 
  gather(var,value) %>% 
  group_by(var) %>%
  summarise(
    Mean = round(mean(as.numeric(value), na.rm = TRUE), 2),
    Responses = length(value[!is.na(value)]),
    "Very Dissatisfied" = paste0(length(value[value == "1" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "1" & !is.na(value)])/Responses*100), ")"),
    "Dissatisfied" = paste0(length(value[value == "2" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "2" & !is.na(value)])/Responses*100), ")"),
    "Neutral" = paste0(length(value[value == "3" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "3" & !is.na(value)])/Responses*100), ")"),
    "Satisfied" = paste0(length(value[value == "4" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "4" & !is.na(value)])/Responses*100), ")"),
    "Very Satisfied" = paste0(length(value[value == "5" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "5" & !is.na(value)])/Responses*100), ")")
  ) %>%
    left_join(
      as.name(paste0((deparse(substitute(df))),"_Teacher_Names")), #Here works with imputed df name but trying to dynamically name teacher df
      by = 'var'
    ) %>%
      rename("Teacher" = teacher) %>%
      select(-var, -value) %>%
      relocate(Teacher, .before = Mean)
}

Q2_Output <- Q2_Table(GE_GC)

When trying to run this function I get the following error even though a matching column called "var" is present in the GE_GC and GE_GC_Teacher_Names data frames.

Error in auto_copy(): ! x and y must share the same src. i set copy = TRUE (may be slow). Run rlang::last_error() to see where the error occurred. >

The following code works fine when I input the teacher data frame name manually

Q2_Table <- function(df){
  
  
  df %>% select(contains("Q2_")) %>% 
  gather(var,value) %>% 
  group_by(var) %>%
  summarise(
    Mean = round(mean(as.numeric(value), na.rm = TRUE), 2),
    Responses = length(value[!is.na(value)]),
    "Very Dissatisfied" = paste0(length(value[value == "1" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "1" & !is.na(value)])/Responses*100), ")"),
    "Dissatisfied" = paste0(length(value[value == "2" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "2" & !is.na(value)])/Responses*100), ")"),
    "Neutral" = paste0(length(value[value == "3" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "3" & !is.na(value)])/Responses*100), ")"),
    "Satisfied" = paste0(length(value[value == "4" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "4" & !is.na(value)])/Responses*100), ")"),
    "Very Satisfied" = paste0(length(value[value == "5" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "5" & !is.na(value)])/Responses*100), ")")
  ) %>%
    left_join(
      GE_GC_Teacher_Names, #Here works with imputed df name but trying to dynamically name teacher df
      by = 'var'
    ) %>%
      rename("Teacher" = teacher) %>%
      select(-var, -value) %>%
      relocate(Teacher, .before = Mean)
}

Q2_Output <- Q2_Table(GE_GC)

So this section is the problem:

left_join(
      as.name(paste0((deparse(substitute(df))),"_Teacher_Names")), #Here works with imputed df name but trying to dynamically name teacher df
      by = 'var'
    )

Any help would be appreciated thank you.

Zeno_____
  • 13
  • 2
  • 2
    Welcome to SO, Zeno_____! (1) Since you're writing function to work programmatically with dplyr pipes, have you read https://dplyr.tidyverse.org/articles/programming.html? (2) `tidyr::gather` is superseded by the much-more-capable (and readable) `pivot_longer` (and `spread` by `pivot_wider`), the transition is not hard and it is worth it for its flexibility. (3) Sample data would be really useful here to see what's going on, please see https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for example uses of `dput(.)` (and other methods). – r2evans Jun 14 '22 at 12:52
  • Thanks for the reply and suggestions, I'll be sure to give pivot_longer a go in the future. I can't really share the data frame as it has identifiable information in it but they are both fairly simple. The first "GE_GC" is just a table with the rows corresponding to certain questions such as Q2_1, Q2_2, .... and the columns are var (which contains the question names/numbers) respondents, mean, counts of Strongly disagree - Strongly Agree. The GE_GC_Teacher_Names data frame has three columns one called var with the question names in it to match GE_GC and then a Teachers column with names – Zeno_____ Jun 14 '22 at 13:07
  • I understand, and ... _describing_ data usually doesn't work well. If it's simple enough, then perhaps you can make a fake frame that mimics your real frame in type/class, variability, factors/categorical variables, etc. It doesn't have to be real, just representative. You are much more likely to get fast and accurate assistance if the question is fully reproducible including sample data and expected output given that sample data. (If you generate random data -- a common practice -- then please use `set.seed` so that we all have the "same random".) – r2evans Jun 14 '22 at 13:19
  • Said differently, I understand completely the premise of sensitive data and asking for help on a public forum. It's certainly not trivial/instant. Please realize that you're asking us to _infer_ what the data looks like for the sake of testing your workflow, where problems we see may be the same problems you're having or completely unrelated problems because we mis-inferred the content. – r2evans Jun 14 '22 at 13:21
  • Lastly (for now), your function takes `df` as an argument, but it breaches scope by using `GE_GC_Teacher_Names`. The practice of using implicit variables in the calling environment (or search path) produces code that is non-reproducible, and because two calls to your function with exactly the same argument(s) can easily produce different results, it makes debugging very difficult and testing effectively impossible or unreliable (not to be trusted). Just some thoughts. – r2evans Jun 14 '22 at 13:24
  • Thanks a lot for all your help Even, I have tired to get a dput of my two data files which I will post below. Please let me know if you need anything else. – Zeno_____ Jun 14 '22 at 13:38
  • GE_GC = structure(list(Q2_1 = c("6", "7", "6", "6", "7", "7")), row.names = c(NA, -6L), class = "data.frame") Please note i only included a Q2 row to keep with character count but all rows are the same structure – Zeno_____ Jun 14 '22 at 13:40
  • GE_GC_Teacher_Names = structure(list(var = c("Q2_1", "Q2_2", "Q2_3", "Q2_4"), value = c("Example", "Example", "Example", "Example"), teacher = 1:4), row.names = c(NA, -4L), class = "data.frame") – Zeno_____ Jun 14 '22 at 13:42

1 Answers1

0

I suggest you simplify this a little by making the Teacher frame an argument to the function. This does two things:

  1. Simplifies your logic, where you are not relying so much on the name of an object (with the assumption of other existing variables); and
  2. Ensures the function is more functional, where its output is derived exclusively by the arguments passed to it, no inference, no guessing.
Q2_Table <- function(df, tchr) {
  df %>%
    select(contains("Q2_")) %>%
    pivot_longer(everything(), names_to = "var") %>%
    group_by(var) %>%
    summarise(
      Mean = round(mean(as.numeric(value), na.rm = TRUE), 2),
      Responses = length(value[!is.na(value)]),
      "Very Dissatisfied" = paste0(length(value[value == "1" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "1" & !is.na(value)])/Responses*100), ")"),
      "Dissatisfied" = paste0(length(value[value == "2" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "2" & !is.na(value)])/Responses*100), ")"),
      "Neutral" = paste0(length(value[value == "3" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "3" & !is.na(value)])/Responses*100), ")"),
      "Satisfied" = paste0(length(value[value == "4" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "4" & !is.na(value)])/Responses*100), ")"),
      "Very Satisfied" = paste0(length(value[value == "5" & !is.na(value)]), " (", sprintf("%1.0f%%", length(value[value == "5" & !is.na(value)])/Responses*100), ")")
    ) %>%
    left_join(tchr, by = 'var') %>%
    rename("Teacher" = teacher) %>%
    select(-var, -value) %>%
    relocate(Teacher, .before = Mean)
}

Q2_Table(GE_GC, GE_GC_Teacher_Names)
# # A tibble: 1 x 8
#   Teacher  Mean Responses `Very Dissatisfied` Dissatisfied Neutral Satisfied `Very Satisfied`
#     <int> <dbl>     <int> <chr>               <chr>        <chr>   <chr>     <chr>           
# 1       1   6.5         6 0 (0%)              0 (0%)       0 (0%)  0 (0%)    0 (0%)          

(Notice that I also shifted from gather to pivot_longer. While it adds nothing here, if you use it more and in more-complicated situations, using this newer function will pay off.)


Data

GE_GC <- structure(list(Q2_1 = c("6", "7", "6", "6", "7", "7")), row.names = c(NA, -6L), class = "data.frame")
GE_GC_Teacher_Names <- structure(list(var = c("Q2_1", "Q2_2", "Q2_3", "Q2_4"), value = c("Example", "Example", "Example", "Example"), teacher = 1:4), row.names = c(NA, -4L), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149