I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr
. I'm new to the package and had a question or two about dbplyr
's lazy evaluation as I'm experiencing some considerable slowness.
Suppose I have connected to my database and wish to perform a select
operation to reduce the number of columns:
results<- my_conn_table %>%
select(id, type, date)
Despite there being millions of rows, viewing results
is near instantaneous (only showing 3 rows for clarity):
> results
# Source: lazy query [?? x 3]
# Database: mysql []
id type date
<int> <chr> <date>
1 1 Customer 1997-01-04
2 2 Customer 1997-02-08
3 3 Business 1997-03-29
...
However, if I attempt to perform a filter operation with:
results<- my_df %>%
select(id, type, date) %>%
filter(type == "Business")
the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?
My initial understanding of the lazy evaluation was that the filter()
query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect()
to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)