6

I'm trying to clean up a data frame and I would like to replace NAs in one column, with a corresponding value from another column. I would also like to do this for multiple columns at once.

Example Data Frame.

set.seed(123) 

dates <- seq(as.Date("2016-01-01"), by = "day", length = 10)  
names <- rep(c("John Doe", "Jane Smith"), each = 5)  
var1_group <- runif(10)  
var2_group <- runif(10)  
var1_person <- runif(10)  
var2_person <- runif(10)  

myDF <- data.frame(names, var1_group, var2_group, var1_person, var2_person)  
myDF <- cbind(dates, myDF)  

After some manipulations using dplyr...

myDF <- myDF %>% mutate_each(funs(lag), contains("group"))  
myDF <- myDF %>% group_by(names) %>% mutate_each(funs(lag), contains("person"))  

I get a bunch of NAs...

        dates      names var1_group var2_group var1_person var2_person  
1  2016-01-01   John Doe         NA         NA          NA          NA  
2  2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242  
3  2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990  
4  2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053  
5  2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674  
6  2016-01-06 Jane Smith  0.9404673 0.10292468          NA          NA  
7  2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960  
8  2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595  
9  2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079  
10 2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810  

What I would like to do now, is replace the NAs from *_person columns with the corresponding value from the *_group column. (See row 6)

        dates      names var1_group var2_group var1_person var2_person  
1  2016-01-01   John Doe         NA         NA          NA          NA  
2  2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242  
3  2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990  
4  2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053  
5  2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674  
6  2016-01-06 Jane Smith  0.9404673 0.10292468   0.9404673   0.1029246     
7  2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960  
8  2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595  
9  2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079  
10 2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810 

This works for one column...

myDF$var1_person <- ifelse(is.na(myDF$var1_person), myDF$var1_group, myDF$var1_person)  

But I'd like to do it for all the columns at once. In my actual data frame, each group is around 20 columns. I've tried a bunch of other stuff, but I don't want to clutter this post with my nonsense.

*Bonus points if you can get the code to match for n variables based on column prefix.

var1_group > var1_person  
var2_group > var2_person
...
varn_group > varn_person   
ouflak
  • 2,458
  • 10
  • 44
  • 49
JPete
  • 63
  • 1
  • 5
  • 4
    Your problem arises because your data is not tidy - you are encoding some index value in the column names. Melt your data into long format (1 person column, 1 group column, 1 index column), then make your replacement, the go back to wide format. See `reshape2` or `tidyr` for converting between long and wide formats. – Gregor Thomas Sep 13 '16 at 23:25
  • Also, since you're already using `dplyr`, `coalesce(var1_person, var1_group)` is equivalent to your `ifelse()` statement. – Gregor Thomas Sep 13 '16 at 23:26
  • How efficient would this approach be? For a very large data frame, say 1 mil rows and 100 columns, would this still be viable? I was thinking that I could also try to `apply` something like `coalesce(col_list_person, col_list_group)` – JPete Sep 14 '16 at 00:14
  • 2
    The estimating isn't difficult: 1MM rows and 100 columns suggests about 50 columns for each index, so that would be about 50MM rows in long format (with only ~3 columns, plus whatever date/name ID columns you have). Depends on your machine. I'd do it because I'd probably never go back to wide format - the long format tends to be more general to work with. If this is the only operation you want to each of your column pairs, probably not worth it. But if you want to do lots of stuff to each column pair than it's super easy in long format with `data.table` or `dplyr`. – Gregor Thomas Sep 14 '16 at 00:26

3 Answers3

3

Here's a "tidyverse" approach. Note that as @Gregor commented, it helps to tidy your data. The following handles this for you and also returns a somewhat tidy data frame. I'll leave it to you to get back into the original format if necessary.

Note that I've used the mutate_cond() function that can be found here.

library(tidyverse)
library(stringr)

myDF %>%
  gather(key = col, value = val, -dates, -names) %>% 
  mutate(col = str_replace(col, "var", "")) %>% 
  separate(col, into = c("var", "group")) %>%
  spread(key = group, value = val) %>% 
  mutate_cond(is.na(person), person = group)

#> Source: local data frame [20 x 5]
#> Groups: names [2]
#> 
#>         dates      names   var      group    person
#> *      <date>     <fctr> <chr>      <dbl>     <dbl>
#> 1  2016-01-01   John Doe     1         NA        NA
#> 2  2016-01-01   John Doe     2         NA        NA
#> 3  2016-01-02   John Doe     1 0.28757752 0.8895393
#> 4  2016-01-02   John Doe     2 0.95683335 0.9630242
#> 5  2016-01-03   John Doe     1 0.78830514 0.6928034
#> 6  2016-01-03   John Doe     2 0.45333416 0.9022990
#> 7  2016-01-04   John Doe     1 0.40897692 0.6405068
#> 8  2016-01-04   John Doe     2 0.67757064 0.6907053
#> 9  2016-01-05   John Doe     1 0.88301740 0.9942698
#> 10 2016-01-05   John Doe     2 0.57263340 0.7954674
#> 11 2016-01-06 Jane Smith     1 0.94046728 0.9404673
#> 12 2016-01-06 Jane Smith     2 0.10292468 0.1029247
#> 13 2016-01-07 Jane Smith     1 0.04555650 0.7085305
#> 14 2016-01-07 Jane Smith     2 0.89982497 0.4777960
#> 15 2016-01-08 Jane Smith     1 0.52810549 0.5440660
#> 16 2016-01-08 Jane Smith     2 0.24608773 0.7584595
#> 17 2016-01-09 Jane Smith     1 0.89241904 0.5941420
#> 18 2016-01-09 Jane Smith     2 0.04205953 0.2164079
#> 19 2016-01-10 Jane Smith     1 0.55143501 0.2891597
#> 20 2016-01-10 Jane Smith     2 0.32792072 0.3181810

Everything but the last line is about tidying the data. The last line (mutate_cond()) handles the substitution of NA values. If your columns are all named this way, then this should extend to any n.

Community
  • 1
  • 1
Simon Jackson
  • 3,134
  • 15
  • 24
3

Here's an idea:

tag <- c("person", "group")

# Create a list of 2 elements, persons and groups.
lst <- lapply(tag, function(x) { myDF[grepl(x, colnames(myDF))] })

# Extract everything before the underscore "_" in the column names
ext <- lapply(seq_along(lst), function(x) { 
  stringi::stri_extract(colnames(lst[[x]]), regex = "^[^_]+(?=_)") })

# Find the common elements between the two
int <- intersect(ext[[1]], ext[[2]]) 

# Create a new list with only the matching subset 
match_list <- lapply(lst, function(x) { select(x, matches(paste(int, collapse = "|"))) })

# Replace all NA values in 'person' by the corresponding values in 'group'
res <- mapply(function(x, y) { replace(x, is.na(x), y[is.na(x)]) }, 
              match_list[[1]], match_list[[2]])

# Assign the result back to the original data.frame
myDF[, colnames(res)] <- res

This should ignore non-matching person/group pairs and only replace on matching vars

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
3

Here is one option using set from data.table which does the replacement in place

library(data.table)
#convert the data.frame to data.table
setDT(myDF)
#get the column name of 'group' and 'person' columns
nm1 <-  grep("group", names(myDF), value = TRUE)
nm2 <-  grep("person", names(myDF), value = TRUE)
#loop through the sequence of 'nm1'
for(j in seq_along(nm1)){
#set the elements in the row that are NA for each 'period' column
#with the corresponding row from 'group' column specified in the "value"
    set(myDF, i = which(is.na(myDF[[nm2[j]]])), j = nm2[j],
                    value = myDF[[nm1[j]]][is.na(myDF[[nm2[j]]])])
}

 myDF
 #        dates      names var1_group var2_group var1_person var2_person
 #1: 2016-01-01   John Doe         NA         NA          NA          NA
 #2: 2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242
 #3: 2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990
 #4: 2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053
 #5: 2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674
 #6: 2016-01-06 Jane Smith  0.9404673 0.10292468   0.9404673   0.1029247
 #7: 2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960
 #8: 2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595
 #9: 2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079
 #10:2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I think this would not handle cases where there are non matching pairs of person/group. For example, should there also be `var3_person <- runif(10); var5_group <- runif(10)` in `myDF`, this would replace `var3_person`'s NA with `var5_group` value – Steven Beaupré Sep 14 '16 at 03:41
  • 2
    @StevenBeaupré I was following the OP's example. – akrun Sep 14 '16 at 04:46
  • 1
    Ok. Just wanted to mention it in case OP's real data wasn't perfectly symmetric. – Steven Beaupré Sep 14 '16 at 10:34
  • 1
    This is what worked for me with the least tweaking for my specific code. I appreciated the replacement in place rather than producing a whole new data frame. Thanks! – JPete Sep 15 '16 at 17:38