R: Updating NAs in a data table with values of another data table

Question

There are two data tables of the following structure:

 DT1 <- data.table(ID=c("A","B","C"), P0=c(1,10,100), key="ID")
 DT2 <- data.table(ID=c("B","B","B","A","A","A","C","C","C"), t=rep(seq(0:2),3), P=c(NA,30,50,NA,4,6,NA,200,700))

In data tableDT2all NAs in column P shall be updated by values P0 out of data table DT1.

If DT2 is ordered by ID like DT1, the problem can be solved like this:

 setorder(DT2,ID)
 idxr <- which(DT2[["t"]]==1)
 set(DT2, i=idxr, j="P", value=DT1[["P0"]])

But how can the data tables be "merged" without ordering DT2 before?

score 5 · Answer 1 · answered Nov 29 '15 at 10:53

5

Here's another option of joining by condition

DT2[is.na(P), P := DT1[.SD, P0]]
DT2
#    ID t   P
# 1:  B 1  10
# 2:  B 2  30
# 3:  B 3  50
# 4:  A 1   1
# 5:  A 2   4
# 6:  A 3   6
# 7:  C 1 100
# 8:  C 2 200
# 9:  C 3 700

answered Nov 29 '15 at 10:53

David Arenburg

91,361
17
137
196

akrun · Accepted Answer · 2015-11-29T10:59:23.843

3

We can join the two datasets on 'ID', for NA values in 'P', we assign 'P' as 'P0', and then remove the 'P0' by assigning it to 'NULL'.

library(data.table)#v1.9.6+
DT2[DT1, on='ID'][is.na(P), P:= P0][, P0:= NULL][]

Or as @DavidArenburg mentioned, we can use ifelse condition after joining on 'ID' to replace the NA elements in 'P'.

DT2[DT1, P := ifelse(is.na(P), i.P0, P), on = 'ID']

edited Nov 29 '15 at 10:59

answered Nov 29 '15 at 10:39

akrun

874,273
37
540
662

I understand, what is done in each `[]`. Is this a kind of one after another operation? – kamath Nov 29 '15 at 10:47
1

@kamath It is similar to `%>%` in `dplyr`. We are updating on the resulting dataset from each operation. – akrun Nov 29 '15 at 10:48
2

You could do this in one tep using `ifelse` as in `DT2[DT1, P := ifelse(is.na(P), i.P0, P), on = 'ID']` – David Arenburg Nov 29 '15 at 10:55
@DavidArenburg Yes, thanks, but I am not sure `ifelse` and data.table can be mixed up. – akrun Nov 29 '15 at 10:57
@DavidArenburg Thanks, I updated. I was thinking the `ifelse` would be slow compared to assigning. – akrun Nov 29 '15 at 11:00
@DavidArenburg Very nice solution. How i.P0 works? Where is some info about this syntax? – mjaniec Jan 10 '19 at 12:31
@mjaniec You will usually use `i.` prefix when you have column with the same name in two data sets that you are joining. Or if you want to be explicit in order to make your code bullet proof. `i.` just tells data.table to take the column from the data set in the `i`th location (as in `DT[i, j, by]`), so if you do a join of type `DT[DT2]`, then `DT2` will be in the `i`th location of `DT`, and if you want to be explicit about taking columns from `DT2`, you will prefix their names with `i.`. In the newer versions there also `x.` prefix. I think you can check out data.tables "NEWS" section on GH. – David Arenburg Jan 10 '19 at 12:41
@DavidArenburg Thanks a lot for swift and detailed answer! So i. refers to DT[i, ...]. I understand, using it is useful to bullet proof the code. In this simple case ifelse(is.na(P), P0, P) seems to work. – mjaniec Jan 10 '19 at 12:47

R: Updating NAs in a data table with values of another data table

2 Answers2

Linked