It depends.
If you want to remove duplicates based solely on TIMESTAMP
, then
## base R
rbind(dat1, subset(dat2, !TIMESTAMP %in% dat1$TIMESTAMP))
# TIMESTAMP PRICE AMOUNT
# 1 2000-01-15 10:15:00 15 120
# 2 2002-05-05 17:15:00 20 100
# 3 2003-03-14 12:30:00 22 170
# 21 2007-12-10 11:55:00 30 200
# 31 2010-10-05 12:10:00 44 240
## dplyr
library(dplyr)
filter(dat2, !TIMESTAMP %in% dat1$TIMESTAMP) %>%
bind_rows(dat1, .)
If you want it to be based on the whole row being the same (e.g., you may have two same-TIMESTAMP
but with different values), then you can do
## base R
dat12 <- rbind(dat1, dat2)
dat12[!duplicated(dat12),]
# TIMESTAMP PRICE AMOUNT
# 1 2000-01-15 10:15:00 15 120
# 2 2002-05-05 17:15:00 20 100
# 3 2003-03-14 12:30:00 22 170
# 5 2007-12-10 11:55:00 30 200
# 6 2010-10-05 12:10:00 44 240
## dplyr
bind_rows(dat1, dat2) %>%
filter(!duplicated(.))
Note: none of this is treating TIMESTAMP
as anything other than a string; for these operations, that may be fine. Ultimately, if you will be doing number-like things with the timestamps (i.e., time-difference), then you will need to convert them to POSIXt
with
## base R
dat12$TIMESTAMP <- as.POSIXct(dat12$TIMESTAMP)
## dplyr
dat12 <- mutate(dat12, TIMESTAMP = as.POSIXct(TIMESTAMP))
Though this brings up a distinct issue: if any of them included sub-seconds, then finding duplicates may be problematic. Because they are stored internally as a numeric
(with classes), there is a known-problem with floating-point equality. For examples of this, see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754. As long as your data is generally have a resolution no finer than seconds, you may never have a problem. The issue with floating-point equality is that it is never a warning or an error, so even if you have one, you may not see it. Your only indication might be the appearance of duplicates in your data.
Note 2: R defaults to "seconds" resolution when printing times on the console, but that doesn't mean that there are no milliseconds in the data.
now <- Sys.time()
now
# [1] "2021-08-27 12:16:58 EDT"
options(digits.secs = 3)
now
# [1] "2021-08-27 12:16:58.539 EDT"
I didn't change now
, this is just an issue with how data is printed on the console, not what the data actually contains.
To solve the de-duplication after converting to POSIXt
-class, if you have sub-second precision anywhere, there are more steps you will need to take in order to be confident it is doing what you need. (Probably too much for this answer.)
Data
dat1 <- structure(list(TIMESTAMP = c("2000-01-15 10:15:00", "2002-05-05 17:15:00", "2003-03-14 12:30:00"), PRICE = c(15L, 20L, 22L), AMOUNT = c(120L, 100L, 170L)), row.names = c(NA, -3L), class = "data.frame")
dat2 <- structure(list(TIMESTAMP = c("2003-03-14 12:30:00", "2007-12-10 11:55:00", "2010-10-05 12:10:00"), PRICE = c(22L, 30L, 44L), AMOUNT = c(170L, 200L, 240L)), row.names = c(NA, -3L), class = "data.frame")