0

I have successfully set up a R SQL Server ODBC connection by doing:

DBI_connection <- dbConnect(odbc(),
                            driver = "SQL Server"
                            server = server_name
                            database = database_name)

Dataset_in_R <- dbFetch(dbSendQuery(DBI_connection,
                                   "SELECT * FROM MyTable_in_SQL"))

3 quick questions:

1-Is there a quicker way to copy data from SQL Server to R? This table has +44million rows and it is still running...

2-If I make any changes to this data in R does it change anything in my MyTable_in_SQL? I dont think so because I have saved it in a global data.frame variable in R, but just checking.

3-How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?

  • 1
    Concerning 1: Do you always need all the rows in your R session? Otherwise you can try using `dplyr::tbl(DBI_connection, "MYTable_in_SQL")`. Executing this will not fetch all rows. Instead you get a snapshot of the table in R and you can further build your query using `dplyr`s data wrangling commands. When you have the data you finally need, you can call `collect()` and the data gets fetched from the server. – Martin Schmelzer Jul 16 '20 at 13:11

2 Answers2

2

1: Is there a quicker way to copy data from SQL Server to R?

The answer here is rather simple to answer. the odbc package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to

  1. Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
  2. If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname.
    If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N or LIMIT N eg: select [select statement] from mytable_in_sql order by [order col] limit 100 which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl function as well, however it is always faster to correct your statement.
    Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
  3. Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have 44 * 10^6 / 1024^3 = 2.6 GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!

As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.

2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?

No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.

3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?

To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes, at which point it will save .Rhistory and .Rdata in the getwd() directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...). However I highly suggest you do not do this for larger datasets, as it will cause your R session to take forever to open the next time you open R, as well as possibly creating unnecessary copies of your data (for example if you import it into df and make a transformation in df2 then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path), which is a much (and I mean MUCH!!) faster alternative to saving as RDS or csv files. These can't be opened as easily in Excel, but can be opened in R using arrow::read_parquet and python using pandas.read_parquet or pyarrow.parquet.read_parquet, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.

Note:

If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData file from your working directory. Usually the documents folder (C:/Users/[user]/Documents) from your system.

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • Thanks very much for your input on this - very much appreciated it. Can you just clarify what's the difference between doing arrow::write_parquet(df, file_path) & save.image('path/image_name.Rdata')? – Daniela Rodrigues Jul 16 '20 at 16:46
  • Exactly what the low-level difference is, is beyond me. Both (to the best of my knowledge) save the files in some binary format, but `arrow` was developed by Apache's development team and has been optimized with a focus on speed and storage space. I believe (but have not tested) that parquet files in general will be smaller in storage size, and I am certain that they will be between 20 - 300 times faster to save to disk and load into memory. – Oliver Jul 16 '20 at 17:02
  • As a small (not reproducible example) in a project that I've just finished, I had to convert about 2076 files from CSV to parquet, with a total of roughly 396 Million rows. The csv files had a total size around 44 Gb, while the parquet files only took up 12 gb. Only about 1/20'th of the time was spend on saving the parquet file, while the remainder was reading the csv files. – Oliver Jul 16 '20 at 17:08
  • That's amazing Oliver. Thanks a lot for sharing all of this. Can you just show an example of the syntax I should use on the file_path argument of the arrow function? – Daniela Rodrigues Jul 16 '20 at 17:10
  • Glad I could help. Basically that is just the file name, for example `'C:/Users/[Username]/Desktop/data.parquet'`. If you read the `help(arrow::write_parquet)` page the argument is called `sink` with the description `an arrow::io::OutputStream or a string which is interpreted as a file path` :-) – Oliver Jul 16 '20 at 17:41
  • But where is that data.parquet? I don't have it in my folder. First I am trying to save my dataset df using: arrow::write_parquet(df, 'C:/User Folders/[Username]'). Do you mean, when I open it, I should do: arrow::read_parquet(df, 'C:/User Folders/[Username]/data.parquet')? Also, what's the name of the package to run arrow::write/read_parquet? – Daniela Rodrigues Jul 16 '20 at 18:24
  • It is similar to using `write_csv` or `write_table`. So the destination should be the file that you want to write to. So when you want to open it, you indeed need to use the full path for `read_parquet` again. :-) – Oliver Jul 16 '20 at 18:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/217987/discussion-between-daniela-rodrigues-and-oliver). – Daniela Rodrigues Jul 16 '20 at 18:34
0

On question 2 you're correct, any changes in R won't change anything in the DB.

About question 3, you can save.image() or save.image('path/image_name.Rdata') and it will save your environment so you can recover it later on another session with load.image('path/image_name.Rdata').

Maybe with this you don't need a faster way to get data from a DB.

flopeko
  • 149
  • 8