0

I am using this code to get difference in hours from two POSIXct dates.

x <- transform(x, HRS = ceiling(as.numeric(SHIP_DATE-PICK_DATE)))

This gives accurate results. However, when I tried to find the hour differences for another similar column, I needed to do this:

x <- transform(x, HRS_ADJ = ceiling(as.numeric(SHIP_DATE-ADJ_PICK_DATE)/60))

PICK_DATE & SHIP_DATE are extracted using the same formula.

x$SHIP_DATE <- ifelse(is.na(as.POSIXct(x$SHIP_DATE, format="%d-%b-%Y %H:%M %p")),
                      yes = as.POSIXct(x$SHIP_DATE, format="%d-%b-%Y %H:%M"),
                      no = as.POSIXct(x$SHIP_DATE, format="%d-%b-%Y %H:%M %p"))
x$SHIP_DATE <- as.POSIXct(x$SHIP_DATE, origin = "1970-01-01")

ADJ_PICK_DATE is computed as below:

x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="EARLY",
                          as.POSIXct(paste(format(x$PICK_DATE, "%d-%b-%Y"), "03:00"),
                                     format="%d-%b-%Y %H:%M"), x$PICK_DATE)
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="LATE",
                          as.POSIXct(paste(format(x$PICK_DATE+86400, "%d-%b-%Y"),
                                           "03:00"), format="%d-%b-%Y %H:%M"),
                          x$ADJ_PICK_DATE)
x$ADJ_PICK_DATE <- as.POSIXct(x$ADJ_PICK_DATE, origin = "1970-01-01")

PICK_TIME is computed to adjust the PICK_DATE, as for any orders between 16:00 & 03:00, the lead time is to be calculated from 3AM.

Questions:

  1. How to efficiently generate the ADJ_PICK_DATE column (now it is too slow)?
  2. How to extract the source data into POSIXct using shorter and more efficient code? (It takes about 10-15 seconds per million data point on my i7 7th Gen CPU)
  3. Why did I need to use different formula for each pair of dates to calculate the no of days?

Sample data (The dates is formatted randomly in the source (PICK_DATE & SHIP_DATE) as both "DD-MMM-YYYY HH:mm" and "DD-MMM-YYYY hh:mm AM/PM"):

PICK_DATE    SHIP_DATE    PICK_TIME  
01-APR-2017 00:51    02-APR-2017 06:55    EARLY  
01-APR-2017 00:51    02-APR-2017 12:11 PM    EARLY  
01-APR-2017 07:51    02-APR-2017 12:11 PM    OKAY  
01-APR-2017 02:51 PM    02-APR-2017 09:39 AM    LATE  
Arani
  • 753
  • 1
  • 9
  • 23
  • Rather than share lots of code that runs on data we don't have, instead share (using `dput()`) just 2 or 3 rows of data that illustrate the problem. I think your question could be reduced to a single short paragraph and maybe 10 lines of code - shorter questions are much more likely to get good quick help. [See more tips here on making good reproducible examples in R](https://stackoverflow.com/q/5963269/903061). – Gregor Thomas Jul 28 '17 at 06:44
  • Thank you, I edited to shorten the query and eliminate redundant code samples. I think I need to include the extraction code because that can be the reason of this weird "bug". If I don't know the source of it, the data cannot be scaled and reused. – Arani Jul 28 '17 at 10:29

1 Answers1

0

OK, I got some of the solutions now.

  1. Using lubridate package, this method takes about 50% processing time:
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="EARLY",
                                  dmy_hm(paste(format(x$PICK_DATE, "%d-%b-%Y"), "03:00")),
                                  ifelse(x$PICK_TIME=="LATE",
                                         dmy_hm(paste(format(x$PICK_DATE+86400, "%d-%b-%Y"),
                                                      "03:00")), x$PICK_DATE))
        x$ADJ_PICK_DATE <- as.POSIXct(x$ADJ_PICK_DATE, origin = "1970-01-01")
  1. Again, using lubridate:
x$SHIP_DATE <- lubridate::dmy_hm(x$SHIP_DATE)
x$PICK_DATE <- lubridate::dmy_hm(x$PICK_DATE)
  1. Probably some formatting error while doing the conversion. I still need help on this problem.
Arani
  • 753
  • 1
  • 9
  • 23