0

I am trying to merge two datasets together in R based on two criteria. They have to have the same id and year. One of the vector has the size of about 10000 and the other 2000. I think if I do a two level one by one search, the computing time would explode. The data is sorted by id and year. Is there a more efficient search algorithm than the naive comparison ?

Yan Song
  • 112
  • 2
  • 13
  • Yes. It's sorted by id and year. I have edited the question. – Yan Song Apr 19 '14 at 23:59
  • Note that 10k and 2k is relatively small numbers for a computer. So, here `explode` means about...1ms or something (extremely relative of course, but you get my point)? Since it's sorted you can skip ahead when things don't match. – keyser Apr 20 '14 at 00:01
  • 3
    Have you tried `merge`? – flodel Apr 20 '14 at 00:04
  • There are 4 vectors. two year and two id vectors. So there would be 2000^10000*2 comparions. I am not sure if I am thinking this problem correctly. – Yan Song Apr 20 '14 at 00:20
  • I think there is a problem with your reasoning here. You should be able to solve this in linear time by using hash tables or linear-time merge – Niklas B. Apr 20 '14 at 00:32
  • 1
    Well, post some sample from each dataset so we can see better. – Fernando Apr 20 '14 at 00:45

1 Answers1

3

There are many solutions to this problem, e.g. by merge, by indexing, by looping (as you said).

However, the most elegant solution is by using the data.table package, which is really fast for managing data sets, and can be considered an evolved version of data.frame.

Let us first set up the data: Based on the limited information that you have provided in the question, here is a dummy attempt to solve the problem.

install.packages("data.table")

library(data.table) 

set.seed(100)
dt1 <- data.table(
  id = 1:10000, 
  Year = sample(1950:2014,size=10000,replace = TRUE), 
  v1 = runif(10000)
  )
head(dt1)

dt2 <- data.table(
  id = sample(1:10000,2000), 
  Year = sample(1950:2014,size=2000,replace = TRUE), 
  v2 = runif(2000),
  v3 = runif(2000)
)
head(dt2)

Once data is set up, remaining part is very simple.

Step1: Set the keys.

setkey(dt1,id,Year)  # Set keys in first table
setkey(dt2,id,Year)  # Set keys in second table

Step2: Merge which ever way you want.

dt1[dt2,nomatch=0]
dt2[dt1,nomatch=0]

The time taken to merge the data is about 0.02 second. This works extremely fast for very large data-sets as well.

system.time(dt1[dt2,nomatch=0])    # 0.02 sec
system.time(dt2[dt1,nomatch=0])    # 0.02 sec

To learn more about data.table

?example(data.table)  

Hope this helps!!

If not, plz post more details!!

Shambho
  • 3,250
  • 1
  • 24
  • 37
  • Thanks for mentioning the data.table package. I find this intro to data.frame helpfu. Apparently, data.table uses binary search instead of vector scan. So it's much faster. [link](http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf) – Yan Song Apr 20 '14 at 19:52
  • Exactly. There are two others nice reads as well. Try: `vignette("datatable-faq")` and `vignette("datatable-timings")` – Shambho Apr 20 '14 at 23:25
  • @YanSong, [this answer](http://stackoverflow.com/a/20057411/559784) may help answer some questions about `data.table` and binary search. – Arun Apr 26 '14 at 00:44
  • @Shambho, +1. Like the definition "evolved version of data.frame" ;) – Arun Apr 26 '14 at 00:45
  • @Shambho, yes, thanks to Matt :). I just tagged along recently. – Arun Apr 26 '14 at 08:04