collect only if query returns less than n_max rows

Question

Occasionally when connecting to my Oracle database through ROracle and dbplyr I will run a dplyr::collect operation that fetches more data than expected and than R can handle.

This can make R crash and is often a sign I should have filtered or aggregated data further before fetching.

It would be great to be able to check the size of the result before choosing to fetch it or not (without running the query twice).

Let's name collect2 the variation of collect that would allow this:

expected behavior:

small_t <- con %>% tbl("small_table") %>%
  filter_group_etc %>%
  collect2(n_max = 5e6) # works fine

big_t   <- con %>% tbl("big_table")   %>%
  filter_group_etc %>%
  collect2(n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000

Would this be possible ?

I'm also open to a solution using ROracle / DBI without dplyr, e.g.:

dbGetQuery2(con, my_big_sql_query,n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000

EDIT:

See below a partial solution posted as an answer, not optimal because some time is wasted fetching data I have no use for.

On the nrows of the query results yes, but how do I fetch the data in r without running the query once for nrow and a second time for the data ? — moodymudskipper, Nov 06 '17 at 10:24
You can't do `df <- con %>% tbl("big_table") %>% filter_group_etc` and query only `nrow(df)`? (I don't really know databases) — F. Privé, Nov 06 '17 at 13:35
I can, it will execute the full query on server side and return the number of rows, then depending on what I get I can then execute it again to get the full results, but doing this I run it 2 times on server side. It does solve my issue of R crashing but it's a big overhead. — moodymudskipper, Nov 06 '17 at 13:49
Are you connecting through to a Database via dplyr? I assume 'yes' but it would be great to know the exact database type [Oracle, MySql, Mongo]? Why? One option is to place limits on the server side query time and size. — Technophobe01, Nov 06 '17 at 19:59
I assumed it would be clear from tags but I'll edit now to make it explicit — moodymudskipper, Nov 06 '17 at 20:48
I would just write the SQL code to return the top X records and send the query string through from R. I usually use RODBC and Microsoft products, but you should be able to do the same thing with `dbSendQuery` in the `ROracle` package. — Ian Wesley, Nov 07 '17 at 00:23
I think it would lead to a similar solution as the one I proposed below, some resources are spent fetching those rows I don't need and the actual n_rows is never known. — moodymudskipper, Nov 07 '17 at 07:06

crazybilly · Answer 1 · 2017-11-08T19:38:19.937

6

This doesn't get around the problem you mention in the comments about spending the resources to get the query twice, but it does seems to work (at least against my MySQL database--I don't have an Oracle database to test it against):

collect2  <- function(query, limit = 20000) {

  query_nrows  <- query %>% 
    ungroup() %>% 
    summarize(n = n()) %>% 
    collect() %>% 
    pull('n')


  if(query_nrows <= limit) {
    collect(query)
  } else {
    warning("Query has ", query_nrows,"; limit is ", limit,". Data will not be collected.")
  }

}

I don't see any way to test the number of rows in the results of a query without actually running the query. With this method, though, you always force the computation of row numbers to happen in the database first and refuse to collect if you're over 20,000 (or whatever your row limit is).

edited Nov 08 '17 at 19:38

answered Nov 07 '17 at 15:47

crazybilly

2,992
1
16
42

One way would be to create a temp table with `dbSendQuery` with something like `con %>% tbl("MY_TBL") %>% dplyr_stuff %>% sql_render %>% {dbSendQuery(con,paste0("CREATE TABLE TEMP_TBL(", .,")"))}`, the first time, and then fetch the result if the number of rows is ok, but I'd need writing permissions and to physically write the data on the server, while it would be more optimal to keep it in memory there while we make our choice to collect or not. – moodymudskipper Nov 07 '17 at 16:00
Still I upvoted because it's nice to have this solution laid out, and maybe (I don't know much about it) if the `dbsm` optimizes the first query it actually runs faster than the second time for the "real deal", I'll have to benchmark it. – moodymudskipper Nov 07 '17 at 16:07
did you mean `warning` when typing `warn` ? If so you can skip the `paste` call with some minor changes as `warning` behaves like `paste0`. If not you should mention what package you take it from. I also suggest you replace `magrittr::extract2('n')` with `pull(n)`. – moodymudskipper Nov 08 '17 at 17:19

score 1 · Accepted Answer · answered Feb 02 '18 at 23:36

You can actually achieve your goal in one SQL query:

Add the row count (n) as an extra column to the data, using dplyr's mutate rather than summarise, and then set n < n_limit as a filter condition. This condition corresponds to a having clause in SQL. If the row count is larger than the list, then no data are collected. Otherwise all data are collected. You may wish to drop the row count column in the end.

This approach should work on most databases. I have verified this using PostgreSQL and Oracle.

copy_to(dest=con, cars, "cars")
df <- tbl(con, "cars")
n_limit <- 51
df %>% mutate(n=n()) %>% filter(n < n_limit) %>% collect

However, it does not work on SQLite. To see why this is the case, you can check the SQL statement generated by the dplyr code:

df %>% mutate(n=n()) %>% filter(n < n_limit) %>% show_query

<SQL>
SELECT *
FROM (SELECT "speed", "dist", COUNT(*) OVER () AS "n"
FROM "cars") "rdipjouqeu"
WHERE ("n" < 51.0)

The SQL contains a window function (count(*) over ()), which is not supported by SQLite.

That was so simple :)! – moodymudskipper Feb 04 '18 at 09:46 — moodymudskipper, Feb 04 '18 at 09:46

dsz · Answer 3 · 2018-05-29T01:09:02.437

So, you can't check the size of the results without running the query.

Now the question is to either cache the results server side and test for the size, or simply put some "insurance" in the R side so that we never receive too many rows.

In the latter case, how about simply:

small_t <- con %>% tbl("small_table") %>%
  filter_group_etc %>%
  head(n=5e6) %>%
  collect()

If you get 5e6 rows, they you probably overflowed; we can't distinguish overflow from exactly 5e6 rows, but that seems a small price to pay to obtain single execution in the DB? Set 5e6 to 5000001 if you're really worried. (And 5000000L or 5000001L would be better options so that they are seen by the DB as integers.)

This doesn't work so well if you're worried about a slow connection, but if you're simply worried about over-flowing memory in R, its a cheap piece of insurance without putting extra load on the server.

But you're fetching these 5e6 rows when you just want to know you overflowed. In some configurations including mine fetching takes a lot of time, often uch more than executing the query on server side. Chosen solution neatly avoids this. — moodymudskipper, May 28 '18 at 06:25

score 0 · Answer 4 · answered Jul 30 '21 at 13:31

0

You can also use the slice_sample() function.

collected_data <- table %>% 
  slice_sample(n = 30) %>% 
  collect()

answered Jul 30 '21 at 13:31

IHADDADEN MOHAMED EL FODIL

1
2

collect only if query returns less than n_max rows

4 Answers4