library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
- one row in A that doesn't join with B (50/NewYork/F).
- one row in B that doesn't join with A (27/NewYork/M).
- two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
- one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
- one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j
argument on the key columns and assign in-place the source column to the target column using the data.table :=
syntax.
Note that in the data.table world X[Y]
does a right join, so if you want a left join you need to reverse it to Y[X]
(with "left" now referring to X
, counter-intuitively). That's why I used Bt[.SD]
instead of (the likely more natural expectation of) .SD[Bt]
. We need a left join on .SD
because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge()
to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income
value is NA for the Income
replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.