-1

I'm curious what the best / fastest ways to combine multiple tables / data.frames is if you do not want duplicate rows.

I have files with tens of millions of entries, however, there may be duplicates across files.

File A
IDNum         Name         Time
123           Light        12:30
234           Lind         12:30
123           Light        1:00

File B
IDNum         Name         Time
123           Light        1:00
123           Light        12:30
234           Lind         12:30

File C
IDNum        Name          Time
666          Ryuk          2:00
123          Light         NA

Desired Output
IDNum        Name          Time
123          Light         NA
123          Light         12:30
123          Light         1:00
234          Lind          12:30
666          Ryuk          2:00

I know there are a lot of ways to do this using various methods of join/merge, but is there any specific way which is fast? I have hundreds of files, with multiple millions of rows, with some unknown, likely large, amount of duplicates.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Jibril
  • 967
  • 2
  • 11
  • 29
  • 1
    [Required reading.](http://stackoverflow.com/a/24376207/4497050) For list to one data.frame, `do.call(rbind, ...)`, `dplyr::bind_rows`, or `data.table::rbindlist`. – alistaire Jul 05 '16 at 06:12
  • rbindlist is already showed as solution. So, you can use that. – akrun Jul 05 '16 at 06:13

1 Answers1

0

We can use rbindlist after placing the datasets in a list and get the unique rows with unique.

 library(data.table)
 unique(rbindlist(mget(ls(pattern = "File\\s+[A-Z]+"))))[order(IDNum, Name)]
 #    IDNum  Name  Time
 #1:   123 Light 12:30
 #2:   123 Light  1:00
 #3:   123 Light    NA
 #4:   234  Lind 12:30
 #5:   666  Ryuk  2:00
akrun
  • 874,273
  • 37
  • 540
  • 662