1

How do I subset missing values in one dataframe with values from another?

Let's say I have two datasets:

dataset 1 shows the amount of food that is produced by a country each day.

 country         day     tonnes of food
## 1       china  1          6
## 2       china  1          NA
## 3       china  2          2
## 4       china  2          NA

dataset2 is the average amount of food by day

country         day     average tonnes of food
## 1       china  1          6
## 3       china  2          2

How can I fill in the NAs of dataset1 with the averages from dataset2.

I.e. IF is.na(dataset1$tonnes) is TRUE then fill in with average for day from dataset2$averagetonnes

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user3200293
  • 181
  • 5
  • 18
  • Just `merge(df1, df2)[-3]` – Rich Scriven Jan 09 '16 at 18:23
  • what does the `-3` do? – user3200293 Jan 09 '16 at 18:39
  • Removes the third column from the merge (the old third column in the original data). Have a look at `merge(df1, df2)`. I think `setNames(merge(df1, df2)[-3], names(df1))` is the closest to what you want. – Rich Scriven Jan 09 '16 at 18:40
  • would that not simply replace all of the values with the averages, as opposed to only the NAs? – user3200293 Jan 09 '16 at 18:47
  • 1
    No. Why would it do that? There are no averages being calculated there. – Rich Scriven Jan 09 '16 at 18:54
  • Unless I am using merge incorrectly, it binds the columns of the dataset rather than fill in the NAs. I.E. I am left with 3 columns. `"day","tonnes.x","tonnes.y"` Rather than two columns `"day" "tonnes"` `tonnes` having the NAs replaced with values from dataframe2. – user3200293 Jan 09 '16 at 19:02

2 Answers2

4

If I understand you correctly using the match function will solve your problem. Data:

df1 <- data.frame(country=c(rep('china1',2),rep('china2',2)),day=c(1,1,2,2),tof = c(6,NA,2,NA),stringsAsFactors = F)
df2 <- data.frame(country=c('china1','china2'),day=c(1,2),atof = c(6,2),stringsAsFactors = F)
df1
  country day tof
#1  china1   1   6
#2  china1   1  NA
#3  china2   2   2
#4  china2   2  NA

This line will replace the NAs with the averages of the corresponding country of the second data.frame df2. The match function results in a vector of positions of matches and [which(is.na(df1$tof))] selects the indices where there is a NA in the “tof” column.

df1$tof[is.na(df1$tof)] <- df2$atof[match(df1$country,df2$country)][which(is.na(df1$tof))]
df1
  country day tof
#1  china1   1   6
#2  china1   1   6
#3  china2   2   2
#4  china2   2   2
hvollmeier
  • 2,956
  • 1
  • 12
  • 17
2

We can use join in data.table

library(data.table)
setDT(df1)[df2, on =c("country", "day")][is.na(tonnes.of.food), 
  tonnes.of.food:= average.tonnes.of.food][, average.tonnes.of.food:=NULL][]
#   country day tonnes.of.food
#1:   china   1              6
#2:   china   1              6
#3:   china   2              2
#4:   china   2              2
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Do you know of an answer using base, reshape or plyr? I'm not acquainted with data.table. – user3200293 Jan 09 '16 at 18:57
  • @user3200293 It is already posted in the comments by RichardScriven – akrun Jan 09 '16 at 18:59
  • That does not give the desired results. Unless I am using merge incorrectly, instead of filling in the NA's, it simply binds the columns of each dataset together. – user3200293 Jan 09 '16 at 19:01