2

Please help a beginner improve their understanding of the amazing data.table package on an basic problem.

I want to update a data table dt with new snippets of information with the same structure and column names dt2, ticker / date / price. Key is ticker and date.

  1. I know one could use rbind(dt, dt2) looking here or here, but I don't want to risk duplicates, so thought rather join them through a merge.

  2. I was expecting merge(dt, dt2, all = TRUE) to do the trick.

    But you get x.price y.price and not a single price column!?

    Why does dt not match the shared columns in 2? Data frame docs imply they use intersect of names but DT seems different.

  3. Currently, I am getting around it with by = colnames(dt) i.e. merge(dt, dt2, all = TRUE, by = colnames(dt)), but is this the right way to do it?

Thanks I would appreciate some guidance/confirmation (and without resorting to another package).

Example Code

library(data.table)
# Main data
dt <- data.table(ticker = c("A", "A"), 
                 date   = c("Jan", "Feb"),
                 price  = c(10, 11),
                 key    = c("ticker", "date"))

# Update data with same column names
dt2 <- data.table(ticker = c("A", "B"), 
                  date   = c("Mar", "Mar"),
                  price  = c(12, 100),
                  key    = c("ticker", "date"))

# Lets put them together
merge(dt, dt2, all = TRUE)
# but get 4 x 4 back (not 4 x 3) with price.x price.y ???

    ticker date price.x price.y
1:      A  Feb      11      NA
2:      A  Jan      10      NA
3:      A  Mar      NA      12
4:      B  Mar      NA     100

# This works, but am I missing easier syntax?
merge(dt, dt2, all = TRUE, by = colnames(dt2))

   ticker date price
1:      A  Feb    11
2:      A  Jan    10
3:      A  Mar    12
4:      B  Mar   100

(An aside) Is it better practice in R to be clear and specify merge.data.table?

Community
  • 1
  • 1
micstr
  • 5,080
  • 8
  • 48
  • 76
  • 2
    Have a look at `?merge.data.table`. The `by` defaults to the common key columns and this is why specifying `by=colnames(dt2)` gives a different result than omitting the `by` argument (which then defaults to the first two columns). – konvas Mar 30 '15 at 11:15
  • 2
    That's because you've set `key = c("ticker", "date")` in `dt`. Change it to `key = c("ticker", "date", "price")`. And you don't need to specify the key in `dt2` at all. Re your second question. `merge` is a generic function so it doesn't care if you write `merge` or `merge.data.table`. But if you want to escape the method dispatch overhead, `merge.data.table` could be a bit faster. This is all appears in documentation of `?merge` – David Arenburg Mar 30 '15 at 11:16
  • Thanks @konvas and David. I need the key later in the code. So that's why I am looking for a temporary override for unkeyed columns. Do you guys suggest I just go with this colnames thing, or do you unkey and reset keys often with DT? – micstr Mar 30 '15 at 11:26

0 Answers0