1

Working with , I have these variables with a strange format called dttm displayed as follows:

 tpep_pickup_datetime  tpep_dropoff_datetime 
               <dttm>                 <dttm>               
  2015-01-15 18:05:39    2015-01-15 18:23:42                    
  2015-01-10 19:33:38    2015-01-10 19:53:28              
  2015-01-10 19:33:38    2015-01-10 19:43:41              
  2015-01-10 19:33:39    2015-01-10 19:35:31               

I would like to calculate the time difference in terms of seconds between tpep_pickup_datetime and tpep_dropoff_datetime.

But using lubridate package it doesn't work. How can I transform these variables into a POSIXCT format using dplyr?

When I use the following code:

my_df %>% 
    mutate(diff_time = difftime(tpep_dropoff_datetime,tpep_pickup_datetime,units = "secs"))

I get this error:

 org.apache.spark.sql.catalyst.parser.ParseException: extraneous input
 'AS' expecting {')', ','}(line 1, pos 121)
M--
  • 25,431
  • 8
  • 61
  • 93
nenno lillo
  • 537
  • 1
  • 4
  • 12
  • Have you tried `difftime` with `units ="secs"` – akrun Oct 29 '19 at 18:07
  • 1
    when i try this `my_df %>% mutate(diff_time = difftime(tpep_dropoff_datetime,tpep_pickup_datetime,units = "secs"))` I get this error: `org.apache.spark.sql.catalyst.parser.ParseException: extraneous input 'AS' expecting {')', ','}(line 1, pos 121)` – nenno lillo Oct 29 '19 at 18:17
  • @joran I thank you for your comment, I would be even more grateful if you could clarify this aspect. I'm not a spark expert, so I don't understand when I should use the collect () command and what the difference is – nenno lillo Oct 29 '19 at 18:26
  • @joran unfortunately when I do `my_df%>% collect ()` I get an error. Probably because my data are many (over 5 GB). What do you advise me to do? – nenno lillo Oct 29 '19 at 18:44
  • The accepted answer broke my heart. You should not do every step like that. Please try my solution, let me know if you get an error, I will look into it to give you a somewhat reasonable answer. – M-- Oct 29 '19 at 23:46
  • Also please share your data using `dput()` – M-- Oct 30 '19 at 00:54

2 Answers2

0

Same as this question of yours we can use dplyr, lubridate::seconds_to_period and base::difftime to get the output.

library(dplyr)
library(lubridate)

df1 %>% 
  mutate(dur = seconds_to_period(as.numeric(difftime(tpep_dropoff_datetime, 
                                                     tpep_pickup_datetime), 
                                                     units = "secs")))

#> # A tibble: 4 x 3
#>   tpep_pickup_datetime tpep_dropoff_datetime dur     
#>   <dttm>               <dttm>                <Period>
#> 1 2015-01-15 18:05:39  2015-01-15 18:23:42   18M 3S  
#> 2 2015-01-10 19:33:38  2015-01-10 19:53:28   19M 50S 
#> 3 2015-01-10 19:33:38  2015-01-10 19:43:41   10M 3S  
#> 4 2015-01-10 19:33:39  2015-01-10 19:35:31   1M 52S

If you want the output in secs and not as.period, then the code below works.

library(dplyr)

df1 %>% 
  mutate(dur = as.numeric(difftime(tpep_dropoff_datetime, 
                                                     tpep_pickup_datetime), 
                                            units = "secs"))

#> # A tibble: 4 x 3
#>   tpep_pickup_datetime tpep_dropoff_datetime   dur
#>   <dttm>               <dttm>                <dbl>
#> 1 2015-01-15 18:05:39  2015-01-15 18:23:42    1083
#> 2 2015-01-10 19:33:38  2015-01-10 19:53:28    1190
#> 3 2015-01-10 19:33:38  2015-01-10 19:43:41     603
#> 4 2015-01-10 19:33:39  2015-01-10 19:35:31     112

Data:

read.table(text="tpep_pickup_datetime  tpep_dropoff_datetime 
  2015-01-15_18:05:39    2015-01-15_18:23:42                    
  2015-01-10_19:33:38    2015-01-10_19:53:28              
  2015-01-10_19:33:38    2015-01-10_19:43:41              
  2015-01-10_19:33:39    2015-01-10_19:35:31", stringsAsFactors=F, header=T) -> df1


df1 %>% 
  mutate_at(vars("tpep_pickup_datetime", "tpep_dropoff_datetime"), 
            list(~as.POSIXct(.,format="%Y-%m-%d_%H:%M:%S",tz=Sys.timezone()))) %>% 
  as_tibble -> df1

df1 #just to show the data

#>   tpep_pickup_datetime tpep_dropoff_datetime
#>   <dttm>               <dttm>               
#> 1 2015-01-15 18:05:39  2015-01-15 18:23:42  
#> 2 2015-01-10 19:33:38  2015-01-10 19:53:28  
#> 3 2015-01-10 19:33:38  2015-01-10 19:43:41  
#> 4 2015-01-10 19:33:39  2015-01-10 19:35:31  
Community
  • 1
  • 1
M--
  • 25,431
  • 8
  • 61
  • 93
0

try this:

  all_data <- all_data %>%
      mutate(new_pickup = as.POSIXct(tpep_pickup_datetime)) %>%
      mutate(day_pickup = as.Date(new_pickup)) %>%
      mutate(time_pickup = paste(hour(new_pickup), minute(new_pickup),second(new_pickup),sep="-")) %>%
      mutate(new_dropoff = as.POSIXct(tpep_dropoff_datetime)) %>%
      mutate(day_dropoff = as.Date(new_dropoff)) %>%
      mutate(time_dropoff = paste(hour(new_dropoff), minute(new_dropoff),second(new_dropoff),sep="-")) %>%
      mutate(trip_duration = ((hour(new_dropoff) - hour(new_pickup))*3600 + (minute(new_dropoff) - minute(new_pickup))*60 + (second(new_dropoff) - second(new_pickup))))
HABLOH
  • 460
  • 2
  • 12