3

I use dplyr for getting data from sql database and it usually works pretty well.

But sometimes my code runs very slow and I'm not sure why. Most of the time I connect to tables with 1 milion rows, I filter some data and then I use collect function like this

flights <- tbl(
  getDbConn("flight_data"),
  in_schema(
    "flights",
    "usa")
  ) %>% 
  filter(destination == "east_coast") %>% 
  filter(city %in% c("NYC", "MIA") == F) %>% 
  filter(passanger_id %in% passangers$id) %>%
  select(city, passanger_id, date) %>%
  collect()

It takes only few seconds to load data from database. But sometimes my code is very slow and I don't know why. It might be related to kind of data is stored in database, like durrations. In that case it takes 10 minut to load data with same code as above.

Only thing that runs always fast is

flights <- tbl(
  getDbConn("flight_data"),
  in_schema(
    "flights",
    "other_table")
  ) %>%
  select(city, passanger_id, date)

Code above runs few seconds. But once I use collect with this data it is slow - 10 minutes or more. I'm curious is there any fast alternative for collect? I tried tbl_df but it is similiary slow.

1 Answers1

1

This is because of lazy evaluation. This part only plans the job:

flights <- tbl(
  getDbConn("flight_data"),
  in_schema(
    "flights",
    "usa")
  ) %>% 
  filter(destination == "east_coast",
         city %in% c("NYC", "MIA"), 
         passanger_id %in% passangers$id) %>%
  select(city, passanger_id, date)

You can see the query you've actually created:

flights %>% 
  show_query()

Only when you run collect, the query is executed and the data are transferred:

flights %>% 
  collect()
bask
  • 11
  • 1
  • Should we use dplyr or DBI? What is the advantage of dplyr if it takes forever I do not see any. – kraggle Sep 13 '22 at 23:50