Fast way to combine tables / data frames without duplicate rows

Question

I'm curious what the best / fastest ways to combine multiple tables / data.frames is if you do not want duplicate rows.

I have files with tens of millions of entries, however, there may be duplicates across files.

File A
IDNum         Name         Time
123           Light        12:30
234           Lind         12:30
123           Light        1:00

File B
IDNum         Name         Time
123           Light        1:00
123           Light        12:30
234           Lind         12:30

File C
IDNum        Name          Time
666          Ryuk          2:00
123          Light         NA

Desired Output
IDNum        Name          Time
123          Light         NA
123          Light         12:30
123          Light         1:00
234          Lind          12:30
666          Ryuk          2:00

I know there are a lot of ways to do this using various methods of join/merge, but is there any specific way which is fast? I have hundreds of files, with multiple millions of rows, with some unknown, likely large, amount of duplicates.

[Required reading.](http://stackoverflow.com/a/24376207/4497050) For list to one data.frame, `do.call(rbind, ...)`, `dplyr::bind_rows`, or `data.table::rbindlist`. — alistaire, Jul 05 '16 at 06:12
rbindlist is already showed as solution. So, you can use that. — akrun, Jul 05 '16 at 06:13

score 0 · Answer 1 · answered Jul 05 '16 at 06:10

We can use rbindlist after placing the datasets in a list and get the unique rows with unique.

 library(data.table)
 unique(rbindlist(mget(ls(pattern = "File\\s+[A-Z]+"))))[order(IDNum, Name)]
 #    IDNum  Name  Time
 #1:   123 Light 12:30
 #2:   123 Light  1:00
 #3:   123 Light    NA
 #4:   234  Lind 12:30
 #5:   666  Ryuk  2:00

Fast way to combine tables / data frames without duplicate rows

1 Answers1