0

I have a large MySQL table (92 cols, 3 million rows) that I'm wrangling in R with dbplyr. I'm new to the package and had a question or two about dbplyr's lazy evaluation as I'm experiencing some considerable slowness.

Suppose I have connected to my database and wish to perform a select operation to reduce the number of columns:

results<- my_conn_table %>%
            select(id, type, date)

Despite there being millions of rows, viewing results is near instantaneous (only showing 3 rows for clarity):

> results

# Source:   lazy query [?? x 3]
# Database: mysql []
      id type     date   
   <int> <chr>    <date>    
 1     1 Customer 1997-01-04 
 2     2 Customer 1997-02-08 
 3     3 Business 1997-03-29 
 ...
 

However, if I attempt to perform a filter operation with:

results<- my_df %>%
            select(id, type, date) %>%
            filter(type == "Business")

the operation takes a very long time to process (tens of minutes in some instances). Is this long processing time a simple function of nrow ~= 3 million? (In other words, there's nothing I can do about it because the dataset is large.) Or is someone able to suggest some general ways to perhaps make this more performant?

My initial understanding of the lazy evaluation was that the filter() query would only return the top few rows, to prevent the long run time scenario. If I wanted all the data I can run collect() to gather the results into my local R session (which I would expect to take a considerable amount of time depending on the query.)

Simon
  • 991
  • 8
  • 30
  • I am not familiar with dbplyr, but with such dimensionality, you may consider learning data manipulation with the `data.table` package – Ubiminor Nov 10 '21 at 16:38
  • Once the data is within my R session `data.table` would indeed be a good approach to take. However, `dbplyr`'s main selling point is its lazy evaluation so until I run `collect()` I don't actually have the data in memory to process with `data.table` – Simon Nov 10 '21 at 16:40
  • then I don't know. looking at the manual (https://smithjd.github.io/sql-pet/chapter-lazy-evaluation-queries.html), I don't see syntax mistakes. maybe it is just the size of the data which takes long to fetch? – Ubiminor Nov 10 '21 at 17:08

2 Answers2

2

Building on @NovaEthos's answer, you can call show_query(results) to get the SQL query that dbplyr generated and is passing to the database. Posting this query here will make it clear whether there is any inefficiency in how the database is being queried.

A further thing to investigate is how your data is indexed. Like an index in a book, an index on a database table helps find records faster.

You might only be asking for 1000 records with type = 'business' but if these records only start from row 2 million, then the computer has to scan two-thirds of your data before it finds the first record of interest.

You can add indexes to existing database tables using something like the following:

query <- paste0("CREATE NONCLUSTERED INDEX my_index_name ON", database_name, ".", schema_name, ".", table_name, "(", column_name, ");")

DBI::dbExecute(db_connection, as.character(query))

Note that this syntax is for SQL Server. Your database may require slightly different syntax. In practice I wrap this in a function with additional checks such as 'does the indexing column exist in the table?' this is demonstrated here.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • Thanks for the reply @Simon.S.A. I've spent some time investigating the query that's sent to the server after your suggestion. I felt that opening a new question with these details would be clearer. If you care to take a look, I've opened the question here: https://stackoverflow.com/questions/69926452/dbplyr-generating-unexpected-sql-query – Simon Nov 11 '21 at 10:08
  • 1
    Following some research answering your other question, it appears the underlying cause is because you are using MySQL and how MySQL executes nested queries. The advise above is still good for general dbplyr use, but not a complete answer when you also consider the flavor of SQL used. – Simon.S.A. Nov 11 '21 at 20:40
  • Thanks Simon. Accepting your answer as it ultimately put me on the right path to a full solution. I will continue with some benchmarking tests to find the most efficient way of performing these queries. Thanks again for your help. – Simon Nov 12 '21 at 08:22
1

One thing you could try is to run the same query but using the DBI package:

res <- DBI::dbSendQuery(con, "SELECT id,type,date FROM your_table_name WHERE type = 'business'")

If it's the same speed, it's your data. If not, it's dbplyr. You could also try playing around with different filter parameters (maybe specify a single ID or date and see how long that takes) and see if the problem persists.

NovaEthos
  • 500
  • 2
  • 10