6

With the below data set, how do I write a data.table call that subsets this table and returns all customer ID's and associated orders for that customer IF that customer has ever purchased SKU 1?

Expected result should return a table that excludes cid 3 and 5 on that condition and every row for customers matching sku==1.

I am getting stuck as I don't know how to write a "contains" statement, == literal returns only sku's matching condition... I am sure there is a better way..

library("data.table")    
df<-data.frame(cid=c(1,1,1,1,1,2,2,2,2,2,3,4,5,5,6,6),
    order=c(1,1,1,2,3,4,4,4,5,5,6,7,8,8,9,9),
    sku=c(1,2,3,2,3,1,2,3,1,3,2,1,2,3,1,2))

    dt=as.data.table(df)
digdeep
  • 624
  • 2
  • 10
  • 21

2 Answers2

9

This is similar to a previous answer, but here the subsetting works in a more data.table like manner.

First, lets take the cids that meet our condition:

matching_cids = dt[sku==1, cid]

the %in% operator allows us to filter to just those items that are contained in the list. so, using the above:

dt[cid %in% matching_cids]

or on one line:

> dt[cid %in% dt[sku==1, cid]]
     cid order sku
  1:   1     1   1
  2:   1     1   2
  3:   1     1   3
  4:   1     2   2
  5:   1     3   3
  6:   2     4   1
  7:   2     4   2
  8:   2     4   3
  9:   2     5   1
 10:   2     5   3
 11:   4     7   1
 12:   6     9   1
 13:   6     9   2
Community
  • 1
  • 1
Peter Fine
  • 2,873
  • 3
  • 14
  • 16
3

I would have thought that it was more (?!) data.table to use keys. I couldn't quite work out how to stick the whole lot on a single line, but I think that this would be a bit quicker on large data, because as I understand it (and I may very well be mistaken) this is the only solution presented thus far that avoids vector scanning (which is slow compared to binary search):

#  Set initial key
setkey(dt,sku)

#  Select only rows with 1 in the sku and return first example of each, setting key to customer id
dts <- dt[ J(1) , .SD[1] , keyby = cid ]

#  change key of dt to cid to match customer id
setkey(dt,cid)

#  join based on common key
dt[dts,.SD]
#    cid order sku
# 1:   1     1   1
# 2:   1     1   2
# 3:   1     2   2
# 4:   1     1   3
# 5:   1     3   3
# 6:   2     4   1
# 7:   2     5   1
# 8:   2     4   2
# 9:   2     4   3
#10:   2     5   3
#11:   4     7   1
#12:   6     9   1
#13:   6     9   2

An alternative that you can do on one line is to use a data.table merge like so...

setkey(dt,sku)
merge( dt[ J(1) , .SD[1] , keyby = cid ] , dt , by = "cid" )
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • 2
    not sure what the -1 is about for this answer, but one thing I'd point out is that I'd be quite surprised if this was faster the vector scan in the other answer for two reasons - first, joining is only faster if the data is already keyed or if the vector scan has a lot of evaluations inside, and two - using `.SD` in this manner (where it results in a call to `[.data.table` inside `[.data.table`) is very slow at the moment – eddi Nov 20 '13 at 15:55