24

I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.

However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes

What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?

Thanks for your ideas and suggestions!

Tamil Selvan
  • 1,600
  • 1
  • 9
  • 25
Michael Roswell
  • 1,300
  • 12
  • 31
  • I was just reading in large tables with `data.table::fread` and `read.csv`. `fread(my.csv, data.table=F)` beats out `fread(my.csv)`, and both take 1/10 as long as `read.csv`. I don't yet know much about how `fread` works but is there an equivalent method for fetching query results? – Michael Roswell Mar 18 '20 at 18:26
  • As I learn more about `fread`, wondering if there is a way to `dbFetch` by `mmap`ing the query results for fast read in. – Michael Roswell Mar 19 '20 at 12:40
  • Did you also try out `dbGetQuery`? Do the queries have to be on the fly? Otherwise you could preprocess some data and store it in an `.rda` and read that. Or maybe you could write the Query result first to a .csv file and then use `fread` – SeGa Mar 23 '20 at 08:47
  • @SeGa, `dbGetQuery` is just as bad... but could you spell out your two recommendations a bit more for me? Those both look promising. I think for my purposes (data exploration, visualization, some cleaning, of data collected by others in a database built by others) being able to do things on the fly is pretty important for now. – Michael Roswell Mar 23 '20 at 13:10
  • 1
    You might be better off running the query in SSMS with "Results to Text". That will create a csv flat file - try fread on that csv? – Russell Fox Mar 24 '20 at 21:22
  • To go on @RussellFox direction, maybe try with a dedicated db software to see the time, if it still takes few hours there is few hope that it can be better in R. I have time to time to use a C++ odbc implementation to extract data (also some GB), and it is incredibly slow... – Chelmy88 Mar 25 '20 at 09:50
  • @RusselFox, is there a good SSMS alternative for mac? FWIWI, the query runs locally on the server using SSMS in a couple minutes; I think the slow part is getting the results of the query into R. Thanks for the ideas! I'm experimenting today! – Michael Roswell Mar 25 '20 at 14:14
  • It's not a good solution and it does seem like there's a deeper problem at it's root, but you might try using the `future` package to read data in as chunks in parallel to speed it up in the short term – Roger-123 Mar 25 '20 at 14:43
  • Azure Data Studio is available for mac: https://github.com/Microsoft/azuredatastudio. Or you might be able to use the BCP command line tools on the server: "bcp "SELECT TOP 5 [BusinessEntityID],[NationalIDNumber],[OrganizationNode],[OrganizationLevel] FROM AdventureWorks2016CTP3.[HumanResources].[Employee] WITH (NOLOCK)" queryout c:\sql\bcp.txt -c -T" – Russell Fox Mar 25 '20 at 15:50
  • https://www.sqlservercentral.com/articles/8-ways-to-export-sql-results-to-a-text-file – Russell Fox Mar 25 '20 at 15:50
  • Truely an interesting problem. As for a UI alternative, [dbeaver](https://dbeaver.io/download/) can be used ([export example](https://stackoverflow.com/a/47515548/10782538)):. As for how to change the format for `dbFetch` one would have to rewrite the source `C++` code, as this is where most of the package code resides. Similar for `DBI`, which seems to simply call the methods from `odbc`. – Oliver Mar 26 '20 at 08:13
  • I believe @SeGa's comment re: "on the fly" was suggesting something like this workflow. In SQL, `CREATE TABLE TEMP as SELECT LONG_COMPLEX_QUERY FROM TABLE` and then `dbFetch()` the pre-computed table? That's a go-to workflow for me as reading in the pre-computed table executed much faster. – Chase May 09 '20 at 14:33
  • 1
    I found this helps in some situations: https://db.rstudio.com/pool/ – BSCowboy Jan 08 '21 at 23:07
  • 1
    I have been running into similar problems on my projects, and I found the solution usually to be to *reduce the amount of data transfered*. There's a reason large datatables reside on databases. The most user-friendly packages I've encounterd is [dbplyr](https://db.rstudio.com/r-packages/dplyr), which allows `dplyr`-style query construction, which only gets evaluated when needed. Can you provide some details on the use cases of your GB+ datasets? – All Downhill From Here Aug 22 '21 at 09:29

2 Answers2

0

My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.

This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC. I use the following functions:

sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.

odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.

odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.

Here is a simple example.

library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
           "SELECT * 
           FROM dbo.table 
           WHERE Collection_ID = 2498") 

odbcCloseAll()

Here is the same example PLUS data manipulation directly from the query result.

    library(dplyr)
    library(RODBC)

    result <- sqlQuery(odbcConnect("ODBCConnectionName"),
               "SELECT * 
               FROM dbo.table 
               WHERE Collection_ID = 2498") %>% 
        mutate(matchid = paste0(schoolID, "-", studentID)) %>% 
        distinct(matchid, .keep_all - TRUE)

     odbcCloseAll()
0

I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper

I have found huge improvements in speed when querying large datasets.

Firstly, Add your connection to your environment.

conn <- DBI::dbConnect(odbc::odbc(),
                   Driver = "",
                   Server = "",
                   Database = "",
                   UID="",
                   PWD="")

devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn, 
               con_id = "test", 
               tables = c("schema.table"))

This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()

Here is a microbenchmark I did to compare the results of both DBI and dbcooper.

mbm <- microbenchmark::microbenchmark(
  DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
  dbcooper = ava_qry() %>% collect() ,  times=5
)

Here are the results of a microbenchmark I did to compare DBI with dbcooper.

Microbenchmark of DBI vs dbcooper

enter image description here

RStam
  • 53
  • 7