-1

I am working on a dataset where I am rounding the posixct date to the nearest hour of one column of data.table with 88 mln rows.

I used round. Date by base and round_date by lubridate, both of them exceeded memory, so could not finish. Finally, I divided the dataset into 4 equal parts, rounded the necessary column, and bound back.

Is there a good method to round date with higher memory efficiency

Thanks in advance

Fahim Uz Zaman
  • 444
  • 3
  • 6
Azat
  • 75
  • 6

1 Answers1

1

This should wowrk nice and fast

#sample data
library( data.table )
n = 1000000
set.seed(123)
DT <- data.table( id = 1:n, 
                  timestamp = sample(seq(as.POSIXct('2017/01/01'), as.POSIXct('2020/05/01'), by="5 mins"), replace = TRUE, n) )

#split timestamp to iDate and iTime
DT[, c("date", "time") := IDateTime( timestamp ) ]
#round the iTime
DT[, time_rounded := round( time, units = "hour" )]
#convert iDate and rounded iTime back to posixct (add timezone if needed)
DT[, timestamp_rounded := as.POSIXct( time_rounded, date = date ) ]

possible issue: rounding to 00:00 the next day... you should test this and adjust the date if needed...

Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • last line is take a long time. I have decided not to join date and time back, so it helped a lot, thank you! – Azat Oct 07 '20 at 17:45
  • 1
    you're welcome.. you could also take a look at the `fasttime`-package for handling large timestamp-columns. – Wimpel Oct 07 '20 at 18:44