2

I have a panel data containing NA values. I would like to fill in the NAs by the values of an other data. Let say I want to complete the following panel with new.df.

panel <- data.frame("time" = c(rep(2000,5), rep(2001,5)), 
                    "var1" = rep(1:5, times=2), 
                    "var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))

new.df <- data.frame("time" = c(2000:2001), 
                     "var1" = c(4,4), 
                     "var2" = c('e','e'))

I tried different combination of merge / aggregate / ddplyr etc.. The issue is that merge or merge.data.frame creates additional columns indexed by .x and .y even tho the colnames are identical.

> merge(panel,new.df,by = c("time","var1"), all=T)
   time var1 var2.x var2.y
1  2000    1   <NA>   <NA>
2  2000    2      b   <NA>
3  2000    3      c   <NA>
4  2000    4   <NA>      e
5  2000    5      d   <NA>
6  2001    1     a1   <NA>
7  2001    2     b1   <NA>
8  2001    3     c1   <NA>
9  2001    4   <NA>      e
10 2001    5     d1   <NA>

I tried also to play with the na.action option without success because the panel will still be incomplete after merging and the remaining NAmust stay as they are. (Depending on the formulation, NA treatment will in some cases replace NA by 0, or by NaN)

I would like to find a way to target the correct indexes in the panel to "insert" new.df$var2 at its right place, knowing that I have a very large panel and it will remain incomplete at the end.

Thanks in advance.

  • Why are you creating those data sets using `cbind` rather a `data.frame`? Is that on purpose? – David Arenburg Jan 18 '17 at 11:27
  • I would suggest you check [this](http://stackoverflow.com/a/41440339/3001626) answer out as it will allow you to update as many columns you like at once. Some thing like `cols <- setdiff(colnames(new.df), c("time", "var1")) ; library(data.table) ; setDT(panel)[setDT(new.df), (cols) := mget(paste0("i.", cols)), on = c("time", "var1")]` – David Arenburg Jan 18 '17 at 11:38

3 Answers3

2

We can use coalesce from tidyr

library(tidyr)
library(dplyr)
full_join(as.data.frame(panel),as.data.frame(new.df),by = c("time","var1")) %>% 
       mutate_each(funs(as.character), var2.x:var2.y) %>%
       mutate(var2= coalesce(var2.x, var2.y)) %>% 
       select(-var2.x, -var2.y)
#   time var1 var2
#1  2000    1 <NA>
#2  2000    2    b
#3  2000    3    c
#4  2000    4    e
#5  2000    5    d
#6  2001    1   a1
#7  2001    2   b1
#8  2001    3   c1
#9  2001    4    e
#10 2001    5   d1

Or we can use a base R option with max.col. Here, 'd1' is the OP's output with merge

 d1$var2 <-d1[,3:4][cbind(1:nrow(d1), max.col(!is.na(d1[3:4]), "first"))]
 d1$var
 #[1] NA   "b"  "c"  "e"  "d"  "a1" "b1" "c1" "e"  "d1"
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Recreate data as data frame

library(dplyr)
panel <- data_frame("time" = c(rep(2000,5), rep(2001,5)), 
                    "var1" = rep(1:5, times=2), 
                    "var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))

new.df <- data_frame("time" = c(2000:2001), 
                     "var1" = c(4,4), 
                     "var2" = c('e','e'))

Solution 1 Fill NA values with base R merge

panelnew <- merge(panel,new.df,by = c("time","var1"), all=T)
panelnew$var2 <- ifelse(is.na(panelnew$var2.x), panelnew$var2.y, panelnew$var2.x)
panelnew[c("time","var1","var2")]
   time var1 var2
1  2000    1 <NA>
2  2000    2    b
3  2000    3    c
4  2000    4    e
5  2000    5    d
6  2001    1   a1
7  2001    2   b1
8  2001    3   c1
9  2001    4    e
10 2001    5   d1

Solution 2 fill NA values with dplyr left_join and mutate

Here I use dplyr left_join to join the new values. Use full_join if you want to also add combinations of time and var1 that were not present in the original panel. You will get columns var2.x and var2.y columns and that's OK because it reflects the fact that there is an old and a new value. Then mutate to replace NA values by the new value.

result <- panel %>% 
    left_join(new.df, by = c("time", "var1")) %>% 
    mutate(var2 = ifelse(is.na(var2.x),var2.y,var2.x))

Then if you want to keep only the columns of interest

result <- result %>% select(time, var1, var2)

What do you plan to do if a new value replaces an existing value? The code above will keep the old value.

Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • It work but I need to control for the factors inside the ifelse function. Thank you! – La Chouette Jan 18 '17 at 18:10
  • What do you mean by "factors"? There are no factors in Akrun's answer. The method is different, I didn't know about the nice `coalesce` function. But the result data frame is the same as Akrun's answer. I guess you may need to convert your var2 columns to character variables as Akrun has done by adding this line `mutate_each(funs(as.character), var2.x:var2.y) %>%` before the `mutate` line. But if you don't include factors in the first place, by creating character variables only, see `?data_frame` then this is not necessary. – Paul Rougieux Jan 20 '17 at 08:10
1

Or simply (assuming that all values in new.df map to a corresponding NA value in your Panel):

ind <- which(paste0(panel[,1],panel[,2]) %in% paste0(new.df[,1],new.df[,2]))
panel[ind,3] = new.df[,3]
count
  • 1,328
  • 9
  • 16