1: Is there a quicker way to copy data from SQL Server to R?
The answer here is rather simple to answer. the odbc
package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC
is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to
- Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
- If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using
last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname
.
If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N
or LIMIT N
eg: select [select statement] from mytable_in_sql order by [order col] limit 100
which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl
function as well, however it is always faster to correct your statement.
Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW
with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
- Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have
44 * 10^6 / 1024^3 = 2.6
GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!
As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.
2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?
No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.
3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes
, at which point it will save .Rhistory
and .Rdata
in the getwd()
directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...)
. However I highly suggest you do not do this for larger datasets, as it will cause your R
session to take forever to open the next time you open R
, as well as possibly creating unnecessary copies of your data (for example if you import it into df
and make a transformation in df2
then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path)
, which is a much (and I mean MUCH!!) faster alternative to saving as RDS
or csv
files. These can't be opened as easily in Excel, but can be opened in R
using arrow::read_parquet
and python using pandas.read_parquet
or pyarrow.parquet.read_parquet
, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.
Note:
If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData
file from your working directory. Usually the documents
folder (C:/Users/[user]/Documents) from your system.