How to find matches across many large data.frames/data.tables from one source file?

Question

I am a slightly novice R user, and I am wondering how to automate finding matching rows (e.g., inner join merge), from one "master" data.table across multiple, large (meaning ~8 GB files) data.tables, and to take those matches and place them in one final table.

The "master" data.table has two columns and looks like this:

pid    pkid
123    8975A
456    9683

While the multiple, larger data.tables look like this:

pid    pkid   pha
123    8975A  3
456    9683   2
789    1253   3

I have my code set up to read all of the special delimited files from one directory, set the key for the data.table merge, and to make names from the files:

temp = list.files(pattern="*.txt")
list2env(
  lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))), 
         fread,sep="|",setkey(packageid,patientid)), envir = .GlobalEnv)

Where I'm stuck is how to automate the numerous (40 something?) merges from the one "master" data.table and to dump all of the matches from the many merges into a separate data.table. Basically, I want all the cases in the larger data.tables that have matching pid and pkid to the "master" to get the corresponding pha.

The only way that I've come up with is tragically manual, and will take a long time to write out, and is undoubtedly highly inefficient.

res1<-data.table::merge(master,dt1)
res2<-data.table::merge(master,dt2)
l=list(res1,res2)
final<-rbindlist(l)

Any thoughts? Thanks!

Something like this: `Reduce(function(x, y) merge(x, y), list(master, dt1, dt2..))`? — Sumedh, Aug 12 '16 at 23:22
I don't know what you're trying to do with list2env there, but I would rbindlist with an idcol instead, and then merge master on the result. — Frank, Aug 12 '16 at 23:56
Sumedh - this method won't take the "pha" column that I wanted. Frank - could you explain some more? — heesemonster, Aug 13 '16 at 00:08
@heesemonster, I don't see why not. You should probably include the `all` set of arguments in `merge`, but that still shouldn't be the reason why `pha` is not in the merged dataframe. FYI, here are the relevant links: [this](http://stackoverflow.com/questions/14096814/r-merging-a-lot-of-data-frames) and [this](http://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list) — Sumedh, Aug 13 '16 at 02:25
After your `lapply`, you have a list of tables. Instead of using list2env, keep it as a list; call rbindlist on it so you have a single table, using the `idcol` argument to keep track of the source; then just merge master with this large table. — Frank, Aug 13 '16 at 04:35
Frank & Sumedh: I was able to get the "pha" column and it all works great now. Thanks so much!! — heesemonster, Aug 14 '16 at 16:14

How to find matches across many large data.frames/data.tables from one source file?

0 Answers0