6

I would like to join two data.tables using the date as join.

Well , sometime i didn't have a exact match and in this case i would like to find the nearest less date. My probleme is very similar to this post about SQL : SQL Join on Nearest less than date

I know data.table syntax is analogous to SQL but I can't to code this. What is the correct syntax?

A simplified example :

Dt1 
   date      x
1/26/2010 - 10  
1/25/2010 - 9  
1/24/2010 - 9   
1/22/2010 - 7    
1/19/2010 - 11

Dt2
   date
1/26/2010   
1/23/2010   
1/20/2010  

output

   date     x
1/26/2010 - 10  
1/23/2010 - 7 
1/20/2010 - 11

Thank you in advance.

Community
  • 1
  • 1
mat
  • 107
  • 1
  • 6

2 Answers2

6

Here you go:

library(data.table)

Create the data:

Dt1 <- read.table(text="
date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11", header=TRUE, stringsAsFactors=FALSE)

Dt2 <- read.table(text="
date
1/26/2010   
1/23/2010   
1/20/2010", header=TRUE, stringsAsFactors=FALSE)

Convert to data.table, convert strings to dates, and set the data.table key:

Dt1 <- data.table(Dt1)
Dt2 <- data.table(Dt2)

Dt1[, date:=as.Date(date, format=("%m/%d/%Y"))]
Dt2[, date:=as.Date(date, format=("%m/%d/%Y"))]

setkey(Dt1, date)
setkey(Dt2, date)

Join the tables, using roll=TRUE:

Dt1[Dt2, roll=TRUE]

           date  x
[1,] 2010-01-20 11
[2,] 2010-01-23  7
[3,] 2010-01-26 10
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 3
    @mat I'm glad this answer helped you. For future reference: it helps a lot if you provide a reproducible code in your question - this encourages people to answer, since it means we don't have to recreate your problem from scratch, as I've done here. Also, if you posted your code with error message, we could have explained why that error message occurs and how you should go about fixing it. – Andrie Jul 05 '12 at 11:11
  • 1
    Thank Andrie and Matthew for your answers ! I had seen this parameter "roll" when i read the documentation . I tried it before but it doesn't work , I had this output Erreur dans `[.data.table`(pixel, Trep, roll = T) : Attempting roll join on factor column i.date. Only integer, double or character colums may be roll joined. I had a probleme of date format. Now i understand my probleme deal with the date format . thanks a lot on more time . – mat Jul 05 '12 at 11:14
2
?data.table                  # search for the `roll` argument
example(data.table)          # search for the example using roll=TRUE
vignette("datatable-intro")  # see section "3: Fast time series join" 
vignette("datatable-faq")    # see FAQs 2.16 and 2.20

This is one of the main features of data.table. Since rows are ordered (unlike SQL) this operation is simple and very fast. SQL is inherently unordered so you need a self join and 'order by' to do this task. It can be done in SQL and it works but it can be slow and needs more code. Since SQL is a row store, even in-memory SQL, it has a lower bound determined by page fetches from RAM into L2 cache. data.table is below that lower bound because it's a column store.

The 2 vignettes are also on the homepage.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224