6

I have two data frames. One (df1) contains all columns and rows of interest, but includes missing observations. The other (df2) includes values to be used in place of missing observations, and only includes columns and rows for which at least one NA was present in df1. I would like to merge the two data sets somehow to obtain the desired.result.

This seems like a very simple problem to solve, but I am drawing a blank. I cannot get merge to work. Maybe I could write nested for-loops, but have not done so yet. I also tried aggregate a few time. I am a little afraid to post this question, fearing my R card might be revoked. Sorry if this is a duplicate. I did search here and with Google fairly intensively. Thank you for any advice. A solution in base R is preferable.

df1 = read.table(text = "
  county year1 year2 year3
    aa     10    20   30
    bb      1    NA    3
    cc      5    10   NA
    dd    100    NA  200
", sep = "", header = TRUE)

df2 = read.table(text = "
  county year2 year3
    bb      2   NA
    cc     NA   15
    dd    150   NA
", sep = "", header = TRUE)

desired.result = read.table(text = "
  county year1 year2 year3
    aa     10    20   30
    bb      1     2    3
    cc      5    10   15
    dd    100   150  200
", sep = "", header = TRUE)
Mark Miller
  • 12,483
  • 23
  • 78
  • 132

3 Answers3

9

aggregate can do this:

aggregate(. ~ county,
          data=merge(df1, df2, all=TRUE), # Merged data, including NAs
          na.action=na.pass,              # Aggregate rows with missing values...
          FUN=sum, na.rm=TRUE)            # ...but instruct "sum" to ignore them.
##   county year2 year3 year1
## 1     aa    20    30    10
## 2     bb     2     3     1
## 3     cc    10    15     5
## 4     dd   150   200   100
thelatemail
  • 91,185
  • 12
  • 128
  • 188
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • 1
    `FUN=Filter, f = Negate(is.na)` would be another option for the function (would keep duplicates, which shouldn't happen anyway if OP specification are correct) – mnel Apr 05 '13 at 01:23
  • Excellent - a prime example of how base R has a lot of really neat and easily interpretable functions which are often overlooked. – thelatemail Apr 05 '13 at 01:44
  • Thank you. Great answer. Although, I think there needs to be a comma after na.pass. I tried to edit the post and add the comma, but I guess edits have to be longer than one character. – Mark Miller Apr 05 '13 at 01:46
  • @MarkMiller next time, edit the commented text too, to say what you're doing. `# (added comma) Aggregate rows...`. Gets you over the character limit, and decreases the "reject" votes. – Matthew Lundberg Apr 05 '13 at 02:23
2

This will do:

m <- merge(df1, df2, by="county", all=TRUE)

dotx <- m[,grepl("\\.x",names(m))]

doty <- m[,grepl("\\.y",names(m))]

dotx[is.na(dotx)] <- doty[is.na(dotx)]

names(dotx) <- sapply(strsplit(names(dotx),"\\."), `[`, 1)

result <- cbind(m[,!grepl("\\.x",names(m)) & !grepl("\\.y",names(m))], dotx)

Checking:

> result
  county year1 year2 year3
1     aa    10    20    30
2     bb     1     2     3
3     cc     5    10    15
4     dd   100   150   200
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
2

Another option unsing reshape2 and working in the long format :

library(reshape2)
## reshape to long format
df1.m <- melt(df1)
df2.m <- melt(df2)
## get common values
idx <- df1.m$county %in% df2.m$county & 
       df1.m$variable%in% df2.m$variable
## replace NA values 
df1.m[idx,]$value <- ifelse(is.na(df1.m[idx,]$value),
                            df2.m$value , 
                            df1.m[idx,]$value)
## get the wide format
dcast(data=df1.m,county~variable)

  county year1 year2 year3
1     aa    10    20    30
2     bb     1     2     3
3     cc     5    10    15
4     dd   100   150   200
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • Your answer is quite general. For example, it still works if I change 10 to NA for county aa in year1 and delete year3 from df2. – Mark Miller Apr 12 '13 at 19:26
  • @MarkMiller yes because it profit from the long format which uses just county as id (a key), other columns are just variables. – agstudy Apr 12 '13 at 20:22