data.table cartesian join warning on legitimate join

Question

I am sorry in advance for the long post which mixes several questions. If not appropriate, please edit or advise what I should do. I am practicing data.table join, here is an imaginary scenario:

"Two robots, each has 4 positions in MovementA and 4 positions in MovementB. Problem to Solve: for each robot, from MoveA to MoveB, there are 4x4 possible Position-pairs Find the 4 pairs with the shortest distance"

Data Setup

library(data.table)

set.seed(20141220)
dtMoveA = data.table(RobotID = rep(1:2, each=4), Position=sample(1:20, 8))
dtMoveB = data.table(RobotID = rep(1:2, each=4), Position=sample(1:20, 8))

# Review Data
rbind(cbind(Movement="Move-A", dtMoveA), cbind(Movement="Move-B", dtMoveB))

    Movement RobotID Position
 1:   Move-A       1       18
 2:   Move-A       1       20
 3:   Move-A       1       15
 4:   Move-A       1        8
 5:   Move-A       2       13
 6:   Move-A       2        2
 7:   Move-A       2        9
 8:   Move-A       2       12
 9:   Move-B       1       18
10:   Move-B       1       14
11:   Move-B       1       13
12:   Move-B       1       17
13:   Move-B       2        5
14:   Move-B       2       16
15:   Move-B       2       20
16:   Move-B       2        3

Solution 1 (using dplyr)

library(dplyr)

dtMoveA %>%
    inner_join(dtMoveB, by="RobotID") %>%
    mutate(AbsDistance = abs(Position.x - Position.y)) %>%
    group_by(RobotID, Position.x) %>%
    filter(AbsDistance == min(AbsDistance)) %>%
    arrange(RobotID, Position.x)

  RobotID Position.x Position.y AbsDistance
1       1          8         13           5
2       1         15         14           1
3       1         18         18           0
4       1         20         18           2
5       2          2          3           1
6       2          9          5           4
7       2         12         16           4
8       2         13         16           3

(attempt) Solution 2 (using data.table)

setkey(dtMoveA, RobotID)
setkey(dtMoveB, RobotID)

dtMoveA[dtMoveB, .(RobotID, Position, i.Position,
                         AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE
    ] [, MinDistance := min(AbsDistance), by=list(RobotID, Position)
    ] [ AbsDistance == MinDistance, .(Position, i.Position, AbsDistance), by=RobotID
    ] [ order(RobotID, Position)]

   RobotID Position i.Position AbsDistance
1:       1        8         13           5
2:       1       15         14           1
3:       1       18         18           0
4:       1       20         18           2
5:       2        2          3           1
6:       2        9          5           4
7:       2       12         16           4
8:       2       13         16           3

Question 1 Can you please correct my Solution2 with the good practices from data.table art?

Question 2 without the parameter allow.cartesian=TRUE data.table warns "Join results in 32 rows; more than 8 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE"

Is it really a cartesian product? Here the join is made only on the common key values, it is just a coincidence in the data which yields a big join results.

Question 3 dtMoveA and dtMoveB have same column names. datatable join makes the distinction by changing the name to i.Position. Is the "i" prefix something hardcoded? And I suppose i.ColumnName always applies to Y member in the X[Y] join expression.

Thanks in advance for any help.

You should really start getting into the habit of asking just one question per post. If I'm not mistaken, one of your other posts this week contained about 5 different questions — Rich Scriven, Dec 21 '14 at 06:47
These Qs are somewhat related to the same test setup. If i make 3 different posts. I am afraid I need to repeat the context each time. However, I absolutely don't want to ruffle anyone the wrong way. Should I remove Q2, Q3 and make two new posts? — Polymerase, Dec 21 '14 at 06:53
Richard, let's focus on this post for now. I am enough beaten on the other post. Please forgive me for my past novice mistake and allow me a chance continue learning. — Polymerase, Dec 21 '14 at 07:03
@Polymerase, not everything is B&W on SO. Here I find the questions quite related for a single post. But of course others might disagree. My advice to you would be to not take comments/suggestions personally. Good luck! — Arun, Dec 21 '14 at 09:07
I apologize if I came off harsh. That was not my intent. I'm a bit of a stickler for the guidelines and had good intentions when I wrote that comment. — Rich Scriven, Dec 22 '14 at 01:16

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

5

Question 1:

Looks great! But the equivalent of inner join is to also add nomatch=0L. Otherwise you'd also get all the rows from dtMoveB. We can't make use of by=.EACHI here AFAICT.

Read this answer and the answer linked under this comment for understanding the purpose of allow.cartesian = TRUE.

Question 2: From ?data.table under the entry for allow.cartesian:

FALSE prevents joins that would result in more than max(nrow(x), nrow(i)) rows. This is usually caused by duplicate values in i's join columns, each of which join to the same group in x over and over again: a misspecified join.

Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table: where every row in i joins to every row in x (a nrow(x) * nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.

Does this answer your question?

Question 3:

Yes. Joins are of the form x[i]. When x and i both share a column name and will be in the joined result, those columns have a i. prefix added to it. It's the same prefix that also allows you to access i's columns in j, where both x and i share the column name in operations of the form x[i, j] or x[i, j, by=.EACHI].

While joining you can change the names to whatever you like though. Here, you could change to position.x and position.y with:

dtMoveA[dtMoveB, .(RobotID, Position.x=Position, Position.y=i.Position,
                 AbsDistance = abs(Position - i.Position)), allow.cartesian=TRUE]

HTH

PS: If you've any suggestions, feel free to add a FR here. Please have a look at the posting guidelines before to do so.

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 21 '14 at 09:04

Arun

116,683
26
284
387

Hi Arun, Your answers are always perfect, thanks again. I am really surprised that I got the DT query OK. As for allow.cartesian, I suspected that it was a precaution to prevent unintentional runaway combinatory explosion. But what threw me off course was the word 'cartesian' b/c I read it litterally. I feel better now, thanks. I thought I made an join syntax error somewhere which triggered a cross join I didn't intend to do. The 'i.' prefix is fine, no rename needed, as long as I know the rule is consistent, I can comply to the default. FR?!? Please give me some years. Still DT baby here. – Polymerase Dec 21 '14 at 11:33
:-) sure. Just thought of something. The part `[AbsDistance == MinDistance, .(Position, i.Position, AbsDistance), by=RobotID]` doesn't need grouping. You just need to get the columns.. You can do: `[AbsDistance == MinDistance, .SD, .SDcols=c("Position", "i.Position", "AbsDistance")]`. – Arun Dec 21 '14 at 19:21
Thanks! that := operator to add a column on a GroupBy is really neat. There is no equivalence in SQL or dplyr. Following your advice the last query could simply be `[AbsDistance == MinDistance]` or `[AbsDistance == MinDistance, .SD]` which also outputs RobotID as info but also includes the redundant MinDistance column. I'm going to try by myself to find out if there is a way to exclude a column, like dplyr's select(dtResult, -MinDistance). – Polymerase Dec 22 '14 at 16:09
Exactly. That's why I had to add `.SDcols`. Once again note that using `.SDcols` is slightly more efficient as you don't subset the column you require (only to delete later). You could do: `[AbsDistance == MinDistance, .SD, .SDcols=-c("RobotID")]`... – Arun Dec 22 '14 at 16:16
Oh this is Greeeaaat, .SDcols = -c("Col1", "Col2") was what I was looking for. This can also be used for excluding columns! I hope the data.table built-in help will be updated to give some examples of that use case. Don't know what other gems still hidden in DataTable but really, there is enough materials to write a book on this topic alone. Hope that you and Matt will write that book one day. – Polymerase Dec 22 '14 at 17:08
Ha, thanks for the encouragement. We're now in the process of writing some detailed vignettes. Hopefully we'll be able to get that done first :-). We are (mostly) just two people working on our free time. There are lot of things on the to-do list :-). – Arun Dec 22 '14 at 17:21

data.table cartesian join warning on legitimate join

1 Answers1