5

I am learning data.table. I have difficulty converting the dplyr join syntax. Can you please recommend the data.table equivalence for the following test cases?

library(data.table)
library(dplyr)

dtProduct <- data.table(
    ProductID  = c(6, 33, 17, 88, 44, 51),
    ProductName= c("Shirt", "Helmet", "Gloves", "Towel", "Chair", "Detergent"),
    Price= c(25, 60, 10, 7.5, 135, 16),
    key = 'ProductID'
)

set.seed(20141216)
dtOrder <- data.table(
    OrderID    = sample(1001:9999, 12),
    CustomerID = sample(271:279, 12, replace=TRUE),
    # NOTE: some non-existent ProductID intentionally introduced
    ProductID  = sample(c(dtProduct[, ProductID], 155, 439), 12, replace=TRUE),
    Qty = sample(1:3, 12, replace=TRUE),
    key = 'OrderID'
)

> tables()
     NAME      NROW NCOL MB COLS                             KEY      
[1,] dtOrder     12    4  1 OrderID,CustomerID,ProductID,Qty OrderID  
[2,] dtProduct    6    3  1 ProductID,ProductName,Price      ProductID

> dtProduct
   ProductID ProductName Price
1:         6       Shirt  25.0
2:        17      Gloves  10.0
3:        33      Helmet  60.0
4:        44       Chair 135.0
5:        51   Detergent  16.0
6:        88       Towel   7.5
> dtOrder
    OrderID CustomerID ProductID Qty
 1:    1651        275         6   3
 2:    2726        272        88   2
 3:    3079        275        88   2
 4:    3168        274        17   1
 5:    4816        277        88   1
 6:    4931        278        51   1
 7:    5134        274       439   2
 8:    5265        272        33   3
 9:    7702        275        33   2
10:    7727        279       155   2
11:    8412        273        88   2
12:    9130        271        17   3

Case1: Show Order Details, no-match ProductID are hidden

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0

Case2: Show Order Details, INCLUDING no-match ProductID

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    transmute(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)

   OrderID ProductID ProductName Qty Price ExtPrice
1     1651         6       Shirt   3  25.0     75.0
2     3168        17      Gloves   1  10.0     10.0
3     9130        17      Gloves   3  10.0     30.0
4     5265        33      Helmet   3  60.0    180.0
5     7702        33      Helmet   2  60.0    120.0
6     4931        51   Detergent   1  16.0     16.0
7     2726        88       Towel   2   7.5     15.0
8     3079        88       Towel   2   7.5     15.0
9     4816        88       Towel   1   7.5      7.5
10    8412        88       Towel   2   7.5     15.0
11    7727       155          NA   2    NA       NA
12    5134       439          NA   2    NA       NA

Case3: Show Order Errors (Only no-match ProductID)

dtOrder %>%
    left_join(dtProduct, by="ProductID") %>%
    filter(is.na(ProductName)) %>%
    select(OrderID, ProductID, ProductName, Qty)

  OrderID ProductID ProductName Qty
1    7727       155          NA   2
2    5134       439          NA   2

Case4: Various Aggregates by ProductID, sort result by TotalSales descending

dtOrder %>%
    inner_join(dtProduct, by="ProductID") %>%
    group_by(ProductID) %>%
    summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
    arrange(desc(TotalSales))

  ProductID OrderCount TotalQty TotalSales
1        33          2        5      300.0
2         6          1        3       75.0
3        88          4        7       52.5
4        17          2        4       40.0
5        51          1        1       16.0


Case5: Various Aggregates by ProductID, sort result by TotalSales descending

  • NOTE1: This time, ProductName is displayed along with ProductID
  • NOTE2: sort by descending TotalSales no longer working (BUG?)

    dtOrder %>%
       inner_join(dtProduct, by="ProductID") %>%
       group_by(ProductID, ProductName) %>%
       summarize(OrderCount=n(), TotalQty=sum(Qty), TotalSales=sum(Qty*Price)) %>%
       arrange(desc(TotalSales))
    
      ProductID ProductName OrderCount TotalQty TotalSales
    1         6       Shirt          1        3       75.0
    2        17      Gloves          2        4       40.0
    3        33      Helmet          2        5      300.0
    4        51   Detergent          1        1       16.0
    5        88       Towel          4        7       52.5
    

Thank you very much in advance for any help.

Polymerase
  • 6,311
  • 11
  • 47
  • 65
  • Hi David, thanks for the edit. I don't know why the code for Case5 cannot be rendered correctly. How could you fix it? Also, I added "Hi datatable experts" in the top, but somehow this text is never accepted. Is there a max length in the post? – Polymerase Dec 17 '14 at 22:32
  • Hey, you just need to add enough spaces :). I don't know why your edit wasn't accepted, maybe because we were editing at the same time? Either way, there is no real need to add this sentence to the question IMO. – David Arenburg Dec 17 '14 at 22:55
  • 5
    @Polymerase Weel you want to learn data.table , but I don't see any data.table code here ( except the one which create the table), What have you so far tried? – agstudy Dec 17 '14 at 23:04
  • @agstudy, I don't understand such a harsh comment. The test cases I presented here were carefully prepared. If you really read my post, you would see that each test case came with a WORKING code using dplyr which is pretty well written. The dplyr query is made with an underlying datatable object. I am wondering if a "native" datatable syntax would be better. In case you have an alternative solution, other than mine or eddi's below. Would you be kind enough to share your experience? – Polymerase Dec 18 '14 at 16:11
  • 4
    @Polymerase, agtudy's comment is not really harsh. Per SO standards this question can be considered as "no research effort" - you show code from one tool/package, and ask for code of another, which could be considered off-topic. Your intentions are honest, I understand, but so are his. – Arun Dec 18 '14 at 19:32

2 Answers2

10

You should look at ?data.table and go through the examples there. It's a very nice way of learning. We're in the process of writing more detailed vignettes FR #944, scheduled for 1.9.8. But until then:


- data.table form

data.table's syntax is of the form:

x[i, j, by, ...] # i = where, j = select|modify|update, by = group by

- subset operations

When i is an integer or logical expression, we call it a subset operation. For example:

x[a > 1]

What does this do? The column a from the data.table x is checked for the condition > 1, which results in a logical vector = length(a). And those rows where the condition evaluates to TRUE are identified, and all the columns corresponding to those rows are returned.

- joins as extension of subsets

concept

In data.table, joins can be seen as a natural extension of subsets. That is, we can think of a join as a subset operation, but using another data.table. This is what we mean by having a consistent syntax -- the form x[i, j, by] is intact.

The first step towards joining in data.tables is setting keys. This can be accomplished using the setkey() function whose purpose is two-fold:

  • reorder the rows of the data.table in increasing order (ascending) by the columns provided. This is done by reference to be memory efficient.

  • mark those columns provided as key columns on which a joins can be performed (if and when you perform a join).

Note that currently, for a join of the form x[i], x needs to have key column(s) set absolutely. i may or may not have it's key set.

  • If i also has it's key column set, then joins are performed by matching the first key column of i with the first key column of x, second with second, etc..

  • If i doesn't have key columns set, then the first column of i is matched to first key column of x, second column of i with second key column of x and so on..

Yes we are aware that it'd be nice to match by column names when i doesn't have key columns but we just didn't have the time to get to it yet.

The second and final step is to perform the join :-).

But how is a join operation an extension of subset? When i is a data.table, for each row in i, it finds the matching row indices in x by matching on x's key columns that we've set. This returns a set of row indices of x for each row in i (or NA if no match is found).

Now we have the matching row indices. All we have to return are the columns. But since i is also a data.table, it might have additional columns as well. So, we return the columns of both x and i for those matching row indices.

example

Here's a small example to help you internalise the concept before we move on. Consider the two data.tables X and Y as shown below:

X = data.table(a=c(1,1,1,2,2,5,6), b=1:7, key="a")
#    a b
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 4
# 5: 2 5
# 6: 5 6
# 7: 6 7

key(X)
# [1] "a"

Y = data.table(a=c(6,2), c=letters[1:2])
#    a c
# 1: 6 a
# 2: 2 b

key(Y)
# NULL

# join
X[Y]
#    a b c
# 1: 6 7 a
# 2: 2 4 b
# 3: 2 5 b

Note that we have used the key= argument in data.table() function to set the key columns directly. Alternatively we could have just created X without keys and then setkey(X, a).

The function key() returns the key columns if any. If no key is set, it returns NULL.

Y doesn't have key columns, and X has only one key column. So join is done using first column a of Y and first key column a of X. a=6 in Y matches with row 7 of X and a=2 on rows 4 and 5.

You can check this by using the argument which = TRUE:

X[as.data.table(6), which=TRUE] # [1] 7
X[as.data.table(2), which=TRUE] # [1] 4 5

This is also a handy (and fast) way to subset a data.table, but using data.table's fast binary search based subset. Since this operation is quite useful, data.table provides an easy way of doing this instead of having to write as.data.table() each time.

# faster way of doing X[a == 6] on data.table with 'a' as key column
X[J(6)] # J for Join
X[J(2)]

# (or)

X[.(6)] # . is an alias for J
X[.(2)]

I think this should further help towards understanding what we mean by subsets are extensions of joins.


back to your question

Now, let's forget for a moment about all these "left", "right", "inner", "outer" etc.. and look at the actual operation you want to perform. You've two data.tables - dtP and dtO (shortened for convenience).

case 1:

For each row of column ProductID in dtO, you want to find the matching rows in dtP, but you don't want to return NA. And you want to do also select the columns you want to output, along with some calculations.

That is, i = dtO and x = dtP. Key column for dtP is set correctly. But the key column for dtO is orderID. If we joine as such, it'll be joining orderID from dtO against productID from dtP, which is wrong.

Either we have to set key of dtO to productID or set key of dtO to NULL and move the column productID as the first column (until matching by names is implemented). Let's set key to productID here:

# set key
setkey(dtO, ProductID)
# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price), nomatch=0L]

It should be quite obvious what this does now. On only matching row indices, extract all these columns (including expression).

Why should we join first and select/aggregate after?

case 2:

Same as case 1, but you need even non matching rows. Key is already set properly from case 1.

# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)]

Return all rows of orderID even if there's no match, and all the columns specified (including expression).

case 3:

You want all the rows in dtO that has no match with dtP.

not-join or anti-join
dtO[!dtP]

Find all rows where dtP's key column matches with dtO. Return all other rows from dtO. If necessary, you can specify all necessary columns in j as well.

case 4:

Read about by=.EACHI from this post.

You are joining by productID and then aggregating by the same column. But why do we need that intermediate result? It's totally unnecessary and waste of memory and computational time! Instead we can make use of by=.EACHI which will evaluate the j-expression on matching rows for each row in i.

dtO[dtP, .(.N, sQty = sum(Qty), sSales = sum(Qty*Price)), by=.EACHI, nomatch=0L][order(-sSales)]

To test your understanding, try to find out why we did not do dtP[dtO, ...] here..

case 5:

Identical to @eddi's.


I personally find it more natural to think in terms of the actual task I want to perform instead of figuring out the type of join function associated with the task I want to perform (I can never remember which data.table is "left" and which one is "right"... and btw, what the heck is "inner", "outer" and "full outer" join anyways?).

HTH

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • (part 1/2) Hi Arun, I am totally honoured to get answered by the data.table author himself. I wish ?data.table includes your above explanation of JOIN and by=.EACHI in the link you mentioned. I come from a world where we must set explicitly the join key. Expecting to find the same concept in ?data.table, I was confused by J, SJ, CJ operators and definitely could not understand that X[Y] does mean join. Agreed that "left", "right", "inner", "outer" etc... is irrelevant here, those are just "relational" qualifiers. But they do symbolize a precise concept of what we want to get out of a join. – Polymerase Dec 18 '14 at 17:14
  • (part 2/2) in data.table, these concepts are achieved with a different syntax. With the same opposite side-effects (nomatch has no meaning to a "traditional" DB person). Perfectly understood for by=.EACHI, thank you for the detailed explanation. You can be assured that I will be a good student. Every word you put here was carefully absorbed. – Polymerase Dec 18 '14 at 17:22
  • Is there anyway to mark this post as answer too? Without removing the answer status from eddi's earlier post? – Polymerase Dec 18 '14 at 17:27
  • No worries. Glad it helped. I'm a co-developer. Matt is the main author. You can't have more than one answer to a Q. It's fine to retain Eddi's as the answer. – Arun Dec 18 '14 at 19:44
  • On joins, I meant the names. How does "inner" relate to the result we obtain, and "outer"? And on data.table, the point was to illustrate that the syntax is consistent for both subsets/joins, as they are not totally different concepts (having a separate function for each of the joins provides an abstraction that they are very much unrelated, IMHO). – Arun Dec 18 '14 at 19:48
  • (part 1/2) Arun, the generic scenarios you want to obtain from joins are: 1: only the common keys (inner), 2:only the NOT common key of the left/right collection (outer with WHERE clause), 3: both common and not common (outer). Regardless of the vocabulary, the "engine" must allow these nuances of queries. Datatable uses nomatch parameter as equivalent to SQL outer and the DT placement as left/right X[Y] vs Y[X]. Strictly speaking, subsetting is extracting a small collection from a bigger collection. While datatable join X[Y] is actually an intersection of two independant collections. – Polymerase Dec 18 '14 at 20:08
  • (part 2/2). Let imagine X[Y] where both X and Y datatables have a lot of common keys and and also duplicated keys (several rows having same key value). The result X[Y] is a collection which is actually bigger than X or Y alone. If we nickpick on the vocabulary, how can this be called a subset? Behind the scene, I suppose the data.table engine does a kind of intersection. In similar fashion than dplyr or SQL query engine. Whether we call that intersect result a subset or join, it is just a terminology convention. The various types of joins, it's just a kind of syntax to mean nomatch=0 or 1. – Polymerase Dec 18 '14 at 20:21
  • @Polymerase, I did say *joins can be seen as a natural extension of subsets* and that they're *related* = not necessarily identical. I perfectly understand the distinction/difference between *subset* and *join*. But my point was to insist on the relation between them (which, again, IMHO separate functions with unrelated names seem to abstract as a different concept). In any case, I'm totally fine to agree to disagree. – Arun Dec 18 '14 at 20:26
9
setkey(dtOrder, ProductID)

(1-2)

# this will be literally what you wrote

dtProduct[dtOrder,
          list(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price),
          nomatch = 0 # or omit this to get (2)
         ]

# but I think you'd be better off with this
dtProduct[dtOrder][, ExtPrice := Qty*Price][]

(3)

# you can again take the literal direction:
dtProduct[dtOrder][!is.na(ProductName)][,
          list(OrderID, ProductID, ProductName, Qty)]

# but again I think you'd be better off with
dtOrder[!dtProduct]

(4-5)

dtProduct[dtOrder, nomatch = 0][,
          list(OrderCount=.N, TotalQty=sum(Qty), TotalSales=sum(Qty*Price)),
          by = list(ProductID, ProductName)][
          order(-TotalSales)]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • eddi, OMG this is absolutely amazing. It is worth indeed to look into datatable. I am totally impressed by this DT syntax. A few questions though: Q1. What is the purpose of setkey(dtOrder, ProductID) both the DT already had key defined. Q2. What if the two DTs are joined on more than 1 column, same syntax (i.e automatic detection via column name?). Q3. What are the SJ() and J() operators? You could solve all the cases without using these, how come? – Polymerase Dec 18 '14 at 00:31
  • @Polymerase Q1: you need to set the key correctly to have a correct match (your key was set to a different column); Q2: if you set the keys for both tables, everything should go through as-is; Q3: take a look at `?J` and the examples there, you can also compare `dtProduct[J(dtOrder$ProductID)]` vs `dtProduct[dtOrder]` – eddi Dec 18 '14 at 18:32