10

I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.

DT1

         tract CreditScore    
1: 36107020401         635  
2: 36083052403          NA  
3: 36091062602          NA  
4: 36067013000          NA  
5: 36083052304          NA  

DT2

         tract CreditScore  
1: 36107020401         635  
2: 36083052403         650  
3: 36091062602         335  
4: 36067013000         777  
5: 36083052304         663  
Arun
  • 116,683
  • 26
  • 284
  • 387
user3067851
  • 524
  • 1
  • 6
  • 20
  • If you're using the `data.table` package, you need to make this clear in your question. It's unclear to me if you mean a data structure specific to `data.table` or the base structure `data.frame`. Is there a reason (such as size / memory concerns) you can't just merge the two tables and use `ifelse()` to create a new column which has the appropriate value? – alexwhitworth Mar 05 '15 at 22:35
  • 1
    @Alex, yes there is. He could as well just load the data into Excel and a do a `vlookup`, don't he? More specifically, `data.table` can perform a *binary* join and update columns *by reference while joining*. This would be more efficient than `merge` + `ifelse` by several factors on a big data set. – David Arenburg Mar 05 '15 at 22:51
  • Thank you for your comments. Yes, the main reason for me choosing to use the data.table package is its overall efficiency. I didn't want to add another column to DT1 if I didn't need to. I will try the answer below and see how it works.....thanks and cheers! – user3067851 Mar 06 '15 at 04:31

2 Answers2

13

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.


The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

  • If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

In this case, since your first column in i is also tract, we'll skip setting key on i.

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
#          tract CreditScore
# 1: 36067013000         777
# 2: 36083052304         663
# 3: 36083052403         650
# 4: 36091062602         335
# 5: 36107020401         635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).


Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Am I right that this uses `CreditScore` from `DT2` for every matching `tract`? As I understand the OP, the `DT2`value should only be taken if it is `NA` in `DT1`. In the example given, it does not matter since it is 635 in both tables for the common record. If they were different, how could the `DT1` value be retained? – DaveTurek Mar 06 '15 at 17:22
  • 1
    I'm being told `.SD` is locked when trying to implement this solution. – MichaelChirico May 19 '15 at 22:22
  • @MichaelChirico yeah, I'm seeing the same behavior and can't understand when it is locked and when it isn't. For instance, it is isn't locked [here](http://stackoverflow.com/a/29659895/3001626) but it is locked [here](http://stackoverflow.com/questions/38910869/join-on-multiple-keys-and-conditional-update#comment65183841_38910869). Dunno... – David Arenburg Aug 12 '16 at 08:30
  • @DavidArenburg hmm, smells like a bug to me – MichaelChirico Aug 12 '16 at 19:56
  • I believe that means you just need to set the key in `DT2` also – dmp Apr 07 '17 at 17:14
0

A dplyr alternative is rows_patch:

library(dplyr)
rows_patch(DT1, DT2)
Maël
  • 45,206
  • 3
  • 29
  • 67