2

I have a R project with lots of dataframes. Each dataframe has similar naming conventions.

{For more detail - each dataframe is from a different electronic form that my field staff fill in, created in ODK and exported from server into a csv file and then imported into an R dataframe. In each electronic form you can either scan an ID barcode or if the barcode won't scan you can manually type in an ID. These go into different columns in the dataframe I want to make a new column with a single ID, regardless of whether the ID came from the "IDscan" column or the "IDmanual" column.}

I want to make a new column in each data frame, based on values from other columns.

I can do this extremely easily for a single data frame using mutate and ifelse;

library(dplyr)

A1stform <- data.frame(a001_idscan = c(123,456,NA), a002_idmanual = c(NA,NA,789))
B2ndform <- data.frame(b001_idscan = c(NA,456,789), b002_idmanual = c(123,NA,NA))
C3rdform <- data.frame(c001_idscan = c(123,456,789), c002_idmanual = c(NA,NA,NA))

A1stform <- A1stform %>%
  mutate(a999_id = ifelse(!is.na(a001_idscan),a001_idscan,a002_idmanual)

But I get in all kinds of trouble when I try to make it into a loop or a function to automatically go through all the dataframes.

I think part the issue is that I can create character strings of the dataframe names and variable names, but I can’t seem to make them refer to the data frame or variable as they just go into the formula as a string.

I’ve read all the documentation about programming in dplyr at vignette("programming", "dplyr”) and tried lots of iterations using quo enquo !! and :=, but I can't make them work.

I’ve also read stack overflow questions and responses here; Use dynamic variable names in `dplyr` and Rename multiple columns by names

But I still can't get it to work.

I did think about trying to come up with a solution using “assign” (I used a loop with “assign” to read all the csvs into the project into the first place) but everytime someone on stack overflow or elsewhere suggested using “assign” someone else comes along and says not to do that, so I haven’t tried very hard to make assign work.

I'm very new to programming, having accidentally stumbled here from an statistics / epidemiology background so any thoughts or advice very much appreciated.

richardb
  • 73
  • 6
  • Is it possible to make all column names the same in all data frames? Might be easier that way. How many columns do you have in each data frame? – NelsonGon Dec 28 '19 at 14:51
  • 1
    Possible, but I'd rather not if I don't have to. I quite like the columns keeping the same name as the questions in ODK form, and it's useful to immediately know where a column came from. There's 30 - 100 columns in each dataframe. But most of them don't need renamed! – richardb Dec 28 '19 at 14:59
  • My answer assumes that you only want to change at columns 001 and 002. Otherwise, you might want to shift values dynamically if there are so many columns for which changes are required. – NelsonGon Dec 28 '19 at 15:26

2 Answers2

3

We could rearrange the columns and perform the manipulations. If the columns to change are more than those in the example, this may be less efficient.:

library(purrr)
   library(dplyr)
    make_ids <- function(df){
    df %>% 
        select(ends_with("001_idscan"),ends_with("002_idmanual"),
               everything()) -> new_data
      keep_values <- names(new_data)[1]

          drop_values <- names(new_data)[2]

          new_data %>% 
           mutate(new_id = ifelse(!is.na(.data[[keep_values]]), 
                                       .data[[keep_values]] ,
                                     .data[[drop_values]]))


          }
        map(mget(ls(pattern = ".*form")),make_ids)

Result:

$A1stform
  a001_idscan a002_idmanual new_id
1         123            NA    123
2         456            NA    456
3          NA           789    789

$B2ndform
  b001_idscan b002_idmanual new_id
1          NA           123    123
2         456            NA    456
3         789            NA    789

$C3rdform
  c001_idscan c002_idmanual new_id
1         123            NA    123
2         456            NA    456
3         789            NA    789

The above result can be compared to the result in the original post:

A1stform %>%
  mutate(a999_id = ifelse(!is.na(a001_idscan),
                           a001_idscan,a002_idmanual))
  a001_idscan a002_idmanual a999_id
1         123            NA     123
2         456            NA     456
3          NA           789     789
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • 1
    Thank you for replying. Really pleased to have some help. However, I don't know how to make that output (which looks great on screen) into an amended dataframe. The dataframes (A1stform, B2ndform etc.) still only have two variables (x001_idscan and x002_idmanual) and don't have the "new_id" column in the dataframe. I could read any one of them using ` <-` (maybe...), but how would I amend all the dataframes?? – richardb Dec 29 '19 at 00:11
  • You have to save the output of the last line of code. This line(`map(mget(ls(pattern = ".*form")),make_ids)`). It gives a list of dataframes with the amendments made. For instance if you save as `my_dfs <- map(mget(ls(pattern = ".*form")),make_ids)`, then my_dfs[[1]] should give the amended `A1stform` and so on. The alternative is to use `assign` with `.Globalenv` but that is often not ideal since it populates the environment with so many dataframes. Lists are better. – NelsonGon Dec 29 '19 at 03:52
1

Thank you so much @NelsonGon, really appreciate your help. Sorry for not grasping the list thing initially - thank you.

I hadn't come across map (from purr) before, so that's super helpful going forward.

I've made a tiny tweak to the naming of the new variable using str_sub from stringr and name from baseR to get the new variable to have the ideal name.

And then used list2env (from BaseR) to get the dfs back out of the list in one command (thanks to Converting a list of data frames into individual data frames in R).

I take your point that lists may well be better — but I’ve moved over to R from stata and other stats packages so I’m more comfortable working with dataframes for now.

So my solution (based almost entirely on Nelson's solution) is as follows;

library(purrr)
library(dplyr)

A1stform <- data.frame(a001_idscan = c(123,456,NA), a002_idmanual = c(NA,NA,789))
B2ndform <- data.frame(b001_idscan = c(NA,456,789), b002_idmanual = c(123,NA,NA))
C3rdform <- data.frame(c001_idscan = c(123,456,789), c002_idmanual = c(NA,NA,NA))
dfnames <- c("A1stform","B2ndform","C3rdform")

make_ids <- function(df){
  letter <- substitute(df) %>%
    str_sub(.,1,1) %>%
    tolower
  new_data <- df %>% 
    select(ends_with("001_idscan"),ends_with("002_idmanual"),everything())
  keep_values <- names(new_data)[1]
  drop_values <- names(new_data)[2]
  new_df2 <- new_data %>% 
    mutate(new_var = ifelse(!is.na(.data[[keep_values]]), 
                            .data[[keep_values]] ,
                            .data[[drop_values]]))
  names(new_df2)[names(new_df2) == "new_var"] <- paste0(letter,"999_id")
  new_df2
}

newdfs <- map(mget(dfnames),make_ids)
list2env(newdfs, envir=.GlobalEnv)

richardb
  • 73
  • 6