3

I was watching a video[1] by Greg Reda about Pandas to see what Pandas can do how it compares with data.table. I was surprised to learn how difficult it was to join tables in data.table. If you watch the video, specifically @49:00 to @52:00 minutes you see that Pandas allows you to join tables based on different column names and you can choose different suffixes for left and right tables. I understand that setkey is used for optimizaion purposes[2] and understand how to join tables using same column names[3]. I tried data.table's merge but had much difficulty setting the by= keyword parameter using different column names. So here are my questions.

Is it possible, in data.table, to join tables based on different column names? If so, how? If not, why not? Also, more usefully, wouldn't this feature be useful? I find it surprising that this issue hasn't come up earlier. Pardon me (and please point me to them) if this has been discussed earlier.

BTW, the data that Greg is talking about is found on his github[4].

  1. https://www.youtube.com/watch?v=1uVWjdAbgBg
  2. https://stackoverflow.com/a/13686768/3892933
  3. Joining tables with identical (non-keyed) column names in R data.table
  4. https://github.com/gjreda/pydata2014nyc
Community
  • 1
  • 1
Vijay
  • 151
  • 1
  • 11
  • Vijay, both features are implemented in 1.9.5. Please let us know on the github page if you've other suggestions for improvement. Thanks. – Arun Aug 13 '15 at 20:41
  • Thanks, Arun. Now, data.table behaves more like regular data.frame. Now I can use data.table even more often! – Vijay Aug 19 '15 at 05:35
  • 1
    Possible duplicate of [merging tables with different column names](http://stackoverflow.com/questions/16047253/merging-tables-with-different-column-names) – rafa.pereira Oct 22 '15 at 13:44

3 Answers3

6

Update: All the features listed below are implemented and is available in the current stable version of data.table v1.9.6 on CRAN.


There are at least these improvements possible for joins in data.tables.

  • merge.data.table gaining by.x and by.y arguments

  • Using secondary keys to join using both forms discussed above without need to set keys, but rather by specifying columns on x and i.

The simplest reason is that we've not managed to get to it yet.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks @Arun. I like `by.x` and `by.y` arguments. I'm not at all complaining. I really like data.table and want it to succeed. I'm learning internals/code of data.table and I'll try to contribute as soon as I'm capable of it. – Vijay Dec 23 '14 at 14:19
  • @Vijay, no worries. Didn't think you were. Glad that you're learning the internals. You can email me (and/or Matt) if you have questions/difficulties (see github project page). Good luck :-). – Arun Dec 23 '14 at 14:40
2

Normally, when joining data.tables the column names don't actually matter. You just need to make sure both tables have a compatible key.

library(data.table)
dt1<-data.table(a=letters[1:10], b=1:10)
setkey(dt1,a)
dt2<-data.table(x=letters[1:10], y=10:1)
setkey(dt2,x)

dt1[dt2]

Basically it will join on all the key columns. Normally you are joining on a key. If you really need to specify non-key columns, you can always cast the data.table to a data.frame and use the standard merge() function

merge(as.data.frame(dt1),dt2, by.x="a", by.y="x")
merge(as.data.frame(dt1),dt2, by.x="b", by.y="y")
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Your advice is sound, MrFlick and I'm aware that the keys don't have to be the same. It is not at all a problem for me but I find it bothersome, and possibly error-prone, for someone (most probably my students) not careful (aware?) of this issue. As for converting to base data.frame, wouldn't it be defeating the whole purpose of using data.table in the first place? – Vijay Dec 23 '14 at 02:00
  • Yes, it would defeat the purpose, but so would joining on non-keyed (indexed) columns. So you're best off just making sure your tables are properly keyed. – MrFlick Dec 23 '14 at 02:06
1

With reference to the Rdatatable github page, if you want to perform functions on your join rather than just merge tables, you can also do d1[d2, somefunc, on = "A==W"], where A is your column in d1 and W is your column in d2.

victor_v
  • 95
  • 8