1

I have a large data set (as a data.table,more than 100 million rows) and wish to calculate the difftime between two column.and the time string format if "%H%M%OS",without leading zeros.eg,data

ID  time0      time1
 1  93005220  93005320
 2  93551060  93553940
 3  93717740  93717740
 4  94840800  94840800
 5  94959510  94959510
 6 101030460 101127870
 7 101425010 101425010
 8 104728320 104728980
 9 105704050 105705890
10 135109830 135402830
11 93005220 150001330

Take the first row as example,I can not use difftime,strptime to calculate the difftime between "93005320"("09:30:05.320") and "93005220"("09:30:05.220"),I have to change "93005220" to the format like "093005.220",then I tried as:

difftime(strptime("093005.220","%H%M%OS"),strptime("093005.320","%H%M%OS"))

But if do like this, is too slow for more than 100 million rows. Is there an efficient way to do this in R?

A.hu
  • 27
  • 4
  • Is there a reason why you dont just subtract the times? the time difference seems to be in the millisecond range so your result would be a number anyway, not a time stamp. – D.J Jan 13 '22 at 07:04
  • @D.J,thanks for you comment.in real data,the time difference may be more than 3 hours,but all time are in the same day. eg. time0="93005220"("09:30:05.220"),time1="150001330"("15:00:01.330").and I add this as the last row to the eg.data. – A.hu Jan 13 '22 at 07:23
  • @D.J.And i do need the result as numeric,i will try to test substract,thanks. – A.hu Jan 13 '22 at 07:31
  • lubridate generally is faster, if you know it is hms format, try `hms(col1) - hms(col2)`. It is likely their implementation will speed it up a bit. – Oliver Jan 13 '22 at 07:31
  • 1
    @Oliver `hms` can not parse the format "93005220". – A.hu Jan 13 '22 at 07:39
  • what unit do you want the outcome to be in? seconds? – Donald Seinen Jan 13 '22 at 07:44
  • @DonaldSeinen yes,seconds. thanks. – A.hu Jan 13 '22 at 07:57

1 Answers1

0

One approach is to consider that each digit contains information about how many seconds have passed, but some are expressed differently (i.e hours, minutes). If all units of the digit were seconds, we could compare them directly using -.

f = function(x){
  c(36000, 3600, 600, 60, 10, 1, .1, .01, .001) *
    x %% c(1e9, 1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1) %/%
    c(1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1, 1e0)
}
ff = Vectorize(function(x, y) sum(f(x) - f(y)))

To then get the difference in seconds, here using dplyr:

df %>%
    mutate(diff = ff(time1, time0))
       time0     time1     diff
       <dbl>     <dbl>    <dbl>
 1  93005220  93005320     0.1 
 2  93551060  93553940     2.88
 3  93717740  93717740     0   
 4  94840800  94840800     0   
 5  94959510  94959510     0   
 6 101030460 101127870    57.4 
 7 101425010 101425010     0   
 8 104728320 104728980     0.66
 9 105704050 105705890     1.84
10 135109830 135402830   173   
11  93005220 150001330 19796.

Explanation: Using Anirban's approach, f transforms an input integer into a vector of its components (padding with 0 where necessary). These components are then converted to seconds by multiplication, a vectorized operation. That is, c(1,2) * c(3,4) evaluates to c(3, 8).

Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • I like this avoid conversion, as `strptime` and string conversion for it takes much time. this is probably what i want, if it can work in parral. – A.hu Jan 13 '22 at 09:19
  • See also similar calculations here: [Convert proprietary time duration format to milliseconds](https://stackoverflow.com/questions/60517547/convert-proprietary-time-duration-format-to-milliseconds) – Henrik Jan 13 '22 at 10:55