2

I’m fairly new to R. I’ve been searching for multiple hours and asking colleagues who have more experience with R for help , but still haven’t found a solution. I did find a related thread here that provided some inspiration, but I couldn't figure out how to extend the approach to solve my issue.

Overview

I have a dataset with 1800+ variables. The issue is that the dataset contains two sets of 140 variables with different data that were created from when the same survey went out at two different timepoints by accident based upon an automation error. The database created two versions of the same variable. I’m trying to index a unique column to check if the value for that row is “NA”, if it is then I want to fill it with values from the second set of 140 columns and overwrite the values in the original 140 columns. If the row isn’t “NA”, then I want it to leave it as is.

Dummy Data Set

Based on the below dummy set, what I’m attempting to do is check “week0date_v2” to see if it’s a NA value for a given row. If yes, then grab the values in that row from the columns from “week0date” to “var3” and write these values to same row in the columns from “week0date_v2” to “var3_v2”. If the value is not NA for a given row, then leave the original data.

library('tidyverse')
dummy_data<- data.frame(
  id = c(1,2,3,4,5,6), 
  week0date_v2 = c("2022/2/1", NA, "2022/2/15", "2022/2/18", "2022/2/15", NA),
  var1_v2 = c(75,NA,66,NA,65,NA), 
  var2_v2 = c(19,NA,67,10,54,NA), 
  var3_v2 = c(NA,NA,99,125,73,NA),
  verify_v2 = c(1,NA,1,1,1,NA),
  week0date = c(NA, "2022/3/2", "2022/4/15", "2022/4/18", NA, "2022/4/15"),
  var1 = c(NA,52,NA,63,NA,75),
  var2 = c(NA,78,NA,NA,NA,63),
  var3 = c(15,64,25,NA,20,13),
  verify = c(NA,1,NA,1,NA,1)
)
dummy_data <- dummy_data %>% mutate_if(is.character, as.Date)

  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2  week0date var1 var2 var3 verify
1  1   2022-02-01      75      19      NA         1       <NA>   NA   NA   15     NA
2  2         <NA>      NA      NA      NA        NA 2022-03-02   52   78   64      1
3  3   2022-02-15      66      67      99         1 2022-04-15   NA   NA   25     NA
4  4   2022-02-18      NA      10     125         1 2022-04-18   63   NA   NA      1
5  5   2022-02-15      65      54      73         1       <NA>   NA   NA   20     NA
6  6         <NA>      NA      NA      NA        NA 2022-04-15   75   63   13      1

Current Thinking

I’ve tried multiple approaches and haven’t been successful.

My colleague and I first tried to get it to work for a single variable (just the “week0date_v2” column), which we were able to do using mutate() and case_when() approach below:

dummy_data_first_attempt <- dummy_data %>% 
  mutate(
    week0date_v2 = case_when(
      !is.na(week0date_v2) ~ week0date_v2, 
      is.na(week0date_v2) ~ week0date
    )) 

I don’t know how to extend the above approach to apply across the span of columns “week0date_v2” to “var3_v2." I tried writing a function to do this, but my skills with writing a function weren’t sufficient.

Any help with how to extend what the above to achieve the below desired outcome would be so greatly appreciated !!!

Desired Outcome

This is what I'm hoping to end up with (i.e., the values from “week0date” to “var3” are copied and pasted into “week0date_v2” to “var3_v2” when a given row in "week0date_v2" is NA ... else leave the original data)

dummy_data_wrangled <- data.frame(
  id = c(1,2,3,4,5,6), 
  week0date_v2 = c("2022/2/1", "2022/3/2", "2022/2/15", "2022/2/18", "2022/2/15", "2022/4/15"),
  var1_v2 = c(75,52,66,NA,65,75), 
  var2_v2 = c(19,78,67,10,54,63), 
  var3_v2 = c(NA,64,99,125,73,13),
  verify_v2 = c(1,NA,1,1,1,NA),
  week0date = c(NA, "2022/3/2", "2022/4/15", "2022/4/18", NA, "2022/4/15"),
  var1 = c(NA,52,NA,63,NA,75),
  var2 = c(NA,78,NA,NA,NA,63),
  var3 = c(15,64,25,NA,20,13),
  verify = c(NA,1,NA,1,NA,1)
)
dummy_data_wrangled
  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      NA         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64        NA  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      NA      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13        NA 2022/4/15   75   63   13      1
lander45
  • 23
  • 3

3 Answers3

0

Add dplyr way below.

I tried both dplyr and base R.

In dplyr way,

dummy_data %>%
  mutate(across(ends_with("_v2") & !starts_with("verify"), ~ifelse(
    !is.na(week0date_v2),
    .x,
    get(str_remove(cur_column(), "_v2"))
  )))

This below is pretty messy for loop way.

for (j in 1:nrow(dummy_data)){
  if (is.na(dummy_data$week0date_v2[j])) {
    for (i in names(dummy_data)[endsWith(names(dummy_data), "_v2") & !startsWith(names(dummy_data), "verify")]) {
      dummy_data[j, i] <- dummy_data[j, str_remove(i, "_v2")]
    }
  }
}

  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      NA         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64        NA  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      NA      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13        NA 2022/4/15   75   63   13      1
Park
  • 14,771
  • 6
  • 10
  • 29
  • Thx for reply. Very helpful. I tried with my original 1800 var dataset, but the dplyr way didn't work :( My issue is that my dummy dataset above is a bit too simple because I can't use str_remove(cur_column() on original data due to variable names that are non-uniform. Is there a way to use select() vs.endsWith() & !startsWith() and str_remove() and cur_column(). For example, something like: select(dummy_data, week0date:var3) ... and ... select(dummy_data, week0date_v2:var3_v2)) – lander45 Jun 08 '22 at 18:08
  • @lander45 Well, your example's column names are pretty simple in this case, so you may find out the way to select those one by yourself, that there are too many columns.... – Park Jun 08 '22 at 23:58
0

Here's a vectorised base R approach -

replace_inds <- is.na(dummy_data$week0date_v2)
cols <- c('week0date', grep('^var\\d+$', names(dummy_data), value = TRUE))
cols_to_replace <- paste0(cols, '_v2')
dummy_data[replace_inds,cols_to_replace] <- dummy_data[replace_inds, cols]
dummy_data

#  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2  week0date var1 var2 var3 verify
#1  1   2022-02-01      75      19      NA         1       <NA>   NA   NA   15     NA
#2  2   2022-03-02      52      78      64        NA 2022-03-02   52   78   64      1
#3  3   2022-02-15      66      67      99         1 2022-04-15   NA   NA   25     NA
#4  4   2022-02-18      NA      10     125         1 2022-04-18   63   NA   NA      1
#5  5   2022-02-15      65      54      73         1       <NA>   NA   NA   20     NA
#6  6   2022-04-15      75      63      13        NA 2022-04-15   75   63   13      1

Note that this answer is very much based on the column names of the data that you have so make sure you adjust it according to your dataset.

Here is some explanation of the code -

replace_inds returns TRUE for NA values in week0date_v2. The TRUE values are the rows that we want to replace.

replace_inds
#[1] FALSE  TRUE FALSE FALSE FALSE  TRUE

cols are the name of the columns which we want to copy the values from in case of NA in week0date_v2.

cols
#[1] "week0date" "var1"      "var2"      "var3"     

cols_to_replace are the columns which are to be replaced by corresponding values in cols.

cols_to_replace
#[1] "week0date_v2" "var1_v2"      "var2_v2"      "var3_v2"  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Very helpful. Thx! I tried with my original 1800 var dataset, but didn't work :( My issue is that my dummy dataset above is a bit too simple because I can't use grep() w/ regex on original data due to variable names that are non-uniform. I tried this w/ select() and it works on dummy dataset but not original. I don't know why not. Ideas? `replace_inds <- is.na(dummy_data$week0date_v2) cols <-colnames(select(dummy_data, week0date:var3)) cols_to_replace <- colnames(select(dummy_data, week0date_v2:var3_v2)) dummy_data[replace_inds,cols_to_replace] <- dummy_data[replace_inds, cols] dummy_data` – lander45 Jun 08 '22 at 03:23
  • After cleaning up my code a bit and clearing objects from my R studio workspace and rerunning the code above that I modified based on what you provided. It now works! Thank you! – lander45 Jun 09 '22 at 14:10
0

Here's an approach that I think might be more "vectorized" than Ronak Shah's. It uses two-column matrix indexing on both sides of congruent sections of the dataframe. Admittedly it does include the verify columns but I don't think that really detracts from this demonstration:

dummy_data[2:6][ which( is.na(dummy_data[2:6]), arr.ind=TRUE)] <-   dummy_data[7:11][which( is.na(dummy_data[2:6]), arr.ind = TRUE) ]
> dummy_data
  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      15         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64         1  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      63      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13         1 2022/4/15   75   63   13      1

I did a separate run without changing the "verify" column. I defined the selection matrix separately:

> tofind <- which( is.na(dummy_data[grepl("(week|var).+v2", names(dummy_data) )]), arr.ind=TRUE)
> cols_to_repl <- grepl("(week|var).+v2", names(dummy_data) )
> dummy_data[cols_to_repl][ tofind ] <-   dummy_data[7:11][ tofind  ]
> dummy_data
  id week0date_v2 var1_v2 var2_v2 var3_v2 verify_v2 week0date var1 var2 var3 verify
1  1     2022/2/1      75      19      15         1      <NA>   NA   NA   15     NA
2  2     2022/3/2      52      78      64        NA  2022/3/2   52   78   64      1
3  3    2022/2/15      66      67      99         1 2022/4/15   NA   NA   25     NA
4  4    2022/2/18      63      10     125         1 2022/4/18   63   NA   NA      1
5  5    2022/2/15      65      54      73         1      <NA>   NA   NA   20     NA
6  6    2022/4/15      75      63      13        NA 2022/4/15   75   63   13      1
IRTFM
  • 258,963
  • 21
  • 364
  • 487