1

I am trying to execute a simple query as given below on data set with 20 million rows with 10 columns , but it is taking very long time to compute the final output (30 minutes) . Is there any better way to achieve the purpose ?

(t<-Sys.time())

rd_1<-as.data.frame(rd_1 %>%
group_by(customer,location_name,Location_Date,Location_Hour) %>%
filter(created_time==max(created_time))%>%
ungroup())

(t<-Sys.time())

Below is the timestamps after running the script ..

[1] "2018-12-19 09:15:47 GMT"

> rd_1<-as.data.frame(rd_1 %>%
+ group_by(customer,location_name,Location_Date,Location_Hour) %>%
+ filter(created_time==max(created_time))%>%
+ ungroup())

> (t<-Sys.time())

[1] "2018-12-19 09:45:25 GMT"
Yogesh Kumar
  • 609
  • 6
  • 22
  • 3
    Try `data.table` option then https://stackoverflow.com/a/24558696/3962914 – Ronak Shah Dec 19 '18 at 10:10
  • The grouping might take more time than the filter, try `. <- group_by(rd_1, customer,location_name,Location_Date,Location_Hour)` and `. <- filter(.,created_time==max(created_time))` and see which step takes time. Also what is `select(rd_1, customer, location_name, Location_Date, Location_Hour) %>% n_distinct` ? – moodymudskipper Dec 19 '18 at 10:20
  • I need to keep all columns after applying filter ...so I wont use select ... etc.... rd_1 is my data frame not a column so it wont be part of group by condition – Yogesh Kumar Dec 19 '18 at 10:22
  • why have you wrapped it all in `as.data.frame`? that would make it spend a lot of time. do you have the same issue with: rd_1 %>% group_by(customer,location_name,Location_Date,Location_Hour) %>% filter(created_time==max(created_time))%>% ungroup() %>% as.data.frame() – Athanasia Mowinckel Dec 19 '18 at 10:34
  • It is still taking similar amount of time when i remove "as.data.frame" and use rd_1%>% ...... – Yogesh Kumar Dec 19 '18 at 11:34
  • I would compute `max(created_time)` in a separate temp table and then make an `inner_join`. Also, I would measure the running time for each step individually. – tomaz Dec 19 '18 at 13:59
  • How ? I am calculating max subject to certain conditions(group by) ....So how would u do that ? – Yogesh Kumar Dec 19 '18 at 14:33

1 Answers1

1

Try:

temp <- rd_1 %>% 
  group_by(customer,location_name,Location_Date,Location_Hour) %>%
  summarise(created_time = max(created_time)) %>%
  ungroup()

rd_1 <- rd_1 %>% 
   inner_join(temp) %>% 
   as.data.frame()

rm(temp)
tomaz
  • 493
  • 4
  • 13
  • Thanks a lot , it worked for me ...Took 2 minutes instead of 30 minutes that my script was taking to finish the task.. – Yogesh Kumar Jan 02 '19 at 17:56