0

I'm trying to left join between two data.tables using a condition (income>cost for example). I need to this in data.table (and not say dplyr) due to efficiency as I'm working huge datasets. My problem is, when I join using the following code, my "income" column gets overwritten with the values of the cost column.

Any suggestion to fix this behavior?

A <- data.table(ID=c(1,2,3,4),cost=c(1000,2000,3000,4000))
B <- data.table(income=c(1500,2500,3500,4500),name=c("A","B","C","D"))

#The output:
B[A,.(ID,cost,income,name),on=.(income>cost)
  ,allow.cartesian=TRUE
  ,nomatch=NULL]

#Expected output:
A <- A %>% mutate(ID_j=1)
B <- B %>% mutate(ID_j=1)

A %>% 
  left_join(B) %>% 
  as.data.frame() %>% 
  filter(cost<income) %>% 
  select(-ID_j)

Used different libraries, but I need to make it work with data.table.

Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22
Fobos
  • 3
  • 2
  • 1
    so just to be clear, the join is just literally all rows in the current row order? – Paul Stafford Allen Feb 10 '23 at 11:10
  • 1
    We have 4 rows in one column and 4 in the other, joining all of A to all of B will give a data frame of 16 rows (4x4). I then filter to only have rows where the income in B is bigger than the cost in A, so the result is 10 rows. If I do this using data.table, the result is 10 rows, but the column gets overwritten and columns cost and income are exactly the same. – Fobos Feb 10 '23 at 11:18

2 Answers2

0

This will do the cross-join aspect (get your fill 16 rows) using a temporary key (k) then filter and drop the k column:

res<-setkey(A[,c(k=1,.SD)],k)[B[,c(k=1,.SD)],allow.cartesian=TRUE][income>cost,-c("k")]

adapted from: How to do cross join in R?

Paul Stafford Allen
  • 1,840
  • 1
  • 5
  • 16
  • Is this efficient? This seems like just joining every row with every row, not using inequality to not give matches, and then filtering right? – Fobos Feb 10 '23 at 11:47
  • No clue - definitely worth testing the timings with a larger dataset. It may be possible to incorporate the comparison/filtering at the point of the intial join. – Paul Stafford Allen Feb 10 '23 at 12:42
0
B[A, .(ID, cost, income = x.income, name), on = .(income > cost)]

results

    ID cost income name
 1:  1 1000   1500    A
 2:  1 1000   2500    B
 3:  1 1000   3500    C
 4:  1 1000   4500    D
 5:  2 2000   2500    B
 6:  2 2000   3500    C
 7:  2 2000   4500    D
 8:  3 3000   3500    C
 9:  3 3000   4500    D
10:  4 4000   4500    D
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22