-1

I have two dataframes with each a column "TIMESTAMP" that is in character format

I would like to add the dataframes together, because they have the complete same structure.
One dataframe has data from 2000 to 2003 and the other one from 2002 to 2010, so they have the same data for 2002 and 2003. When I combine them I hope to not have duplicates for 2002 and 2003.

What possibilities do I have? I tried newdata <- rbind(data1, data2) but here I get the duplicates.
Is there a possibility that only "TIMESTAMP" will be added that won't be a duplicate then?

Before:

TIMESTAMP                PRICE   AMOUNT
2000-01-15 10:15:00      15      120 
2002-05-05 17:15:00      20      100
2003-03-14 12:30:00      22      170

TIMESTAMP                PRICE   AMOUNT
2003-03-14 12:30:00      22      170
2007-12-10 11:55:00      30      200
2010-10-05 12:10:00      44      240

As you see the 2003-03-14 12:30:00 is the same in both dataframes.
I'd like an output like this:

TIMESTAMP                PRICE   AMOUNT
2000-01-15 10:15:00      15      120 
2002-05-05 17:15:00      20      100
2003-03-14 12:30:00      22      170
2007-12-10 11:55:00      30      200
2010-10-05 12:10:00      44      240

I want to do a event study, with these data so I think a datetime format is the standard for this or am I wrong?

Dave2e
  • 22,192
  • 18
  • 42
  • 50
Hawky
  • 49
  • 10
  • The title suggests the use of `as.POSIXct` to convert a string to a `POSIXt`-class object. However, the body of your question lends one to believe that you want to combine (row-concatenate) them without duplicates. This would be cleared up with sample data by pasting the output from `dput(x)` for both frames, where `x` is a representative sample. If you know you have duplicates in your real data, you should be sure that the samples you provide include at least one duplicate as well. After that, please show what you expect the output to be. Thanks! – r2evans Aug 27 '21 at 15:39
  • 1
    Thanks for your answer, I added the input and output. Hope this makes it a bit clearer. – Hawky Aug 27 '21 at 16:05
  • That's a good improvement, thank you. FYI, if you look at the bottom of my answer, I used `dput(dat1)` to produce the `structure(...)`. That format is very much preferred in two situations: (1) there may be a class we don't know about; if your `TIMESTAMP` is a string or a `POSIXt` class, we cannot tell the difference by looking just at the console output; and (2) when there are embedded spaces in any of the fields, as there are in your `TIMESTAMP`. It takes a little more effort to just "use" the data on my console. Ergo `dput(x)`. – r2evans Aug 27 '21 at 16:18

1 Answers1

1

It depends.

If you want to remove duplicates based solely on TIMESTAMP, then

## base R
rbind(dat1, subset(dat2, !TIMESTAMP %in% dat1$TIMESTAMP))
#              TIMESTAMP PRICE AMOUNT
# 1  2000-01-15 10:15:00    15    120
# 2  2002-05-05 17:15:00    20    100
# 3  2003-03-14 12:30:00    22    170
# 21 2007-12-10 11:55:00    30    200
# 31 2010-10-05 12:10:00    44    240

## dplyr
library(dplyr)
filter(dat2, !TIMESTAMP %in% dat1$TIMESTAMP) %>%
  bind_rows(dat1, .)

If you want it to be based on the whole row being the same (e.g., you may have two same-TIMESTAMP but with different values), then you can do

## base R
dat12 <- rbind(dat1, dat2)
dat12[!duplicated(dat12),]
#             TIMESTAMP PRICE AMOUNT
# 1 2000-01-15 10:15:00    15    120
# 2 2002-05-05 17:15:00    20    100
# 3 2003-03-14 12:30:00    22    170
# 5 2007-12-10 11:55:00    30    200
# 6 2010-10-05 12:10:00    44    240

## dplyr
bind_rows(dat1, dat2) %>%
  filter(!duplicated(.))

Note: none of this is treating TIMESTAMP as anything other than a string; for these operations, that may be fine. Ultimately, if you will be doing number-like things with the timestamps (i.e., time-difference), then you will need to convert them to POSIXt with

## base R
dat12$TIMESTAMP <- as.POSIXct(dat12$TIMESTAMP)

## dplyr
dat12 <- mutate(dat12, TIMESTAMP = as.POSIXct(TIMESTAMP))

Though this brings up a distinct issue: if any of them included sub-seconds, then finding duplicates may be problematic. Because they are stored internally as a numeric (with classes), there is a known-problem with floating-point equality. For examples of this, see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754. As long as your data is generally have a resolution no finer than seconds, you may never have a problem. The issue with floating-point equality is that it is never a warning or an error, so even if you have one, you may not see it. Your only indication might be the appearance of duplicates in your data.

Note 2: R defaults to "seconds" resolution when printing times on the console, but that doesn't mean that there are no milliseconds in the data.

now <- Sys.time()
now
# [1] "2021-08-27 12:16:58 EDT"
options(digits.secs = 3)
now
# [1] "2021-08-27 12:16:58.539 EDT"

I didn't change now, this is just an issue with how data is printed on the console, not what the data actually contains.

To solve the de-duplication after converting to POSIXt-class, if you have sub-second precision anywhere, there are more steps you will need to take in order to be confident it is doing what you need. (Probably too much for this answer.)


Data

dat1 <- structure(list(TIMESTAMP = c("2000-01-15 10:15:00", "2002-05-05 17:15:00", "2003-03-14 12:30:00"), PRICE = c(15L, 20L, 22L), AMOUNT = c(120L, 100L, 170L)), row.names = c(NA, -3L), class = "data.frame")
dat2 <- structure(list(TIMESTAMP = c("2003-03-14 12:30:00", "2007-12-10 11:55:00", "2010-10-05 12:10:00"), PRICE = c(22L, 30L, 44L), AMOUNT = c(170L, 200L, 240L)), row.names = c(NA, -3L), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149