How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package

Question

I'm getting an error to read large table into R from SQL Server.

Here is my connection code:

library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(), 
     .connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')

Here is a schema of my table which has 149 variables:

data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")

I got an error from this code probably because of large table.

I would like to reduce the large table (number of observations) applying "VisitDateTime" variable.

data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")

This code selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table.

Hoping to get some efficient codes. Greatly appreciate your help on this. Thank you.

*"I got an error"* could be anything. It's about as specific as saying "I have a rash" and not including anything like location, color, purulence, or any other factor that might help a medical professional in identifying a possible cause and possible treatments. Please make this question more reproducible; refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. — r2evans, Dec 01 '20 at 20:38
Which is it, [tag:sql-server] or [tag:mysql]? It can't be both. (I don't think it matters *which* DBMS you are using, frankly, but please be precise and accurate in your question.) — r2evans, Dec 01 '20 at 20:41
Thank you @jarlh, could you please provide more detailed code based on my scenario. It will be more helpful. — DBhatta, Dec 01 '20 at 20:45
Thanks @r2evans. This is SQL Server. You can see my question at the end, which is "This code (data2) selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table." — DBhatta, Dec 01 '20 at 20:51
The code you provided at the end doesn't suggest any DBMS, it is only pseudo-code, and doesn't help us to fill in the blanks in your question. The error could be due to: your OS, your R, your network, the DBMS, the table name, the database/schema (`[eCW]`). Some of which might be controlled in R. Some of which might be bad luck timing, fixable by going to get a coffee and trying again. Your error could be *anything*. I don't think it has to do with 150 columns. It might have something to do with "large fields", a bug/feature Microsoft has imposed on ODBC connections (that no other DBMS does). — r2evans, Dec 01 '20 at 20:55
Much of that ambiguity may be reduced if you give us the literal text of the error associated with your `"SELECT * FROM ..."` query. — r2evans, Dec 01 '20 at 20:58
@r2evans, see OP's [original question](https://stackoverflow.com/q/65014679/1422451) where this stems from. OP is running into this known DBI error: [R DBI ODBC error: nanodbc/nanodbc.cpp:3...](https://stackoverflow.com/questions/45001152/r-dbi-odbc-error-nanodbc-nanodbc-cpp3110-07009-microsoftodbc-driver-13-fo). I advised OP to explicitly name columns in `SELECT` and not `*` and place larger text types to the *end* of `SELECT` as above SO post suggests. — Parfait, Dec 01 '20 at 21:06
That's the error I was suggesting with "large fields", thanks for the clarity @Parfait. — r2evans, Dec 01 '20 at 22:38

Parfait · Accepted Answer · 2020-12-02T21:12:12.980

According to your schema, you have many variable length types, varchar, of 255 character lengths. As multiple answers on the similar error post suggests, you cannot rely on arbitrary order with SELECT * but must explicitly reference each column and place variable lengths toward the end of SELECT clause. In fact, generally in application code running SQL, avoid SELECT * FROM. See Why is SELECT * considered harmful?

Fortunately, from your schema output using INFORMATION_SCHEMA.COLUMNS you can dynamically develop such a larger named list for SELECT. First, adjust and run your schema query as an R data frame with a calculated column to order smallest to largest types and their precision/lengths.

schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
                    , sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
                    , sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
               FROM 
                  (SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE 
                        , CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
                        , NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
                        , CASE DATA_TYPE
                                WHEN 'tinyint'   THEN 1
                                WHEN 'smallint'  THEN 2
                                WHEN 'int'       THEN 3
                                WHEN 'bigint'    THEN 4
                                WHEN 'date'      THEN 5
                                WHEN 'datetime'  THEN 6
                                WHEN 'datetime2' THEN 7
                                WHEN 'decimal'   THEN 8
                                WHEN 'varchar'   THEN 9
                                WHEN 'nvarchar'  THEN 10
                          END AS SELECT_TYPE_ORDER
                   FROM INFORMATION_SCHEMA.COLUMNS
                   WHERE SCHEMA_NAME = 'eCW'
                     AND TABLE_NAME = 'Visits'
                  ) sub
               ORDER BY sub.SELECT_TYPE_ORDER
                      , sub.NUMERIC_PRECISION
                      , sub.NUMERIC_PRECISION_RADIX
                      , sub.NUMERIC_SCALE
                      , sub.CHARACTER_MAXIMUM_LENGTH
                      , sub.CHARACTER_OCTET_LENGTH"

visits_schema_df <- dbGetQuery(con, schema_sql)

# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")

# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))

Above may need adjustment if same error arises. Be proactive and test on your end by isolating the problem columns, column types, etc. A few suggestions include filtering out DATA_TYPE, COLUMN_NAME or moving around ORDER columns in schema query.

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND DATA_TYPE IN ('tinyint', 'smallint', 'int')  -- TEST WITH ONLY INTEGER TYPES

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('varchar', 'nvarchar')     -- TEST WITHOUT VARIABLE STRING TYPES

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('decimal', 'datetime2')    -- TEST WITHOUT HIGH PRECISION TYPES

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT COLUMN_NAME IN ('LastHIVTestResult')     -- TEST WITHOUT LARGE VARCHARs

...
ORDER BY sub.SELECT_TYPE_ORDER                         -- ADJUST ORDERING
       , sub.NUMERIC_SCALE                             
       , sub.NUMERIC_PRECISION
       , sub.NUMERIC_PRECISION_RADIX
       , sub.CHARACTER_OCTET_LENGTH
       , sub.CHARACTER_MAXIMUM_LENGTH

Still another solution is to stitch the R data frame together by their types (adjusting schema query) using the chain merge on the primary key (assumed to be DW_Id):

final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
                     list(data_int_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
                          data_num_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT 
                          data_dt_columns,         # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
                          data_char_columns)       # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
              )

I used second option `AND NOT DATA_TYPE IN ('varchar')` and it works. There is a problem with `varchar` data type. I did `AND DATA_TYPE IN ('varchar')` and it did not work, got same error. — DBhatta, Dec 02 '20 at 01:25
Good. We now isolated which types are the issue. Now create a schema dataset of just `varchar`. Have R dump out the column names and then one by one add them to final `SELECT` (at the end) and see which ones raises the error. I am seeing you have one column, `LastHIVTestResult` at 1,024 character length. Try running all but this one to test. If no error, find the maximum length of this field with data: `SELECT MAX(LEN(LastHIVTestResult)) AS Max_Length FROM [ecW].[Visits];` Then pass number in final query `LEFT(LastHIVTestResult, ###)` or cast/covert to `VARCHAR(###)` on this field. — Parfait, Dec 02 '20 at 02:37
Not sure what `102variables` is but as you run it, this would be an SQL error. In R you need to call `paste` or `sprintf` to combine string variables together. Possibly in generating this object, you removed spaces in names. See edit to test if that large varchar alone causes error which involves adding a `WHERE` condition to schema dataset: `AND COLUMN_NAME NOT IN ('LastHIVTestResult', ...)`. — Parfait, Dec 02 '20 at 21:13
I write column names inside the [...] and then one by one add them to final `SELECT` (at the end) like `dbGetQuery(con,"SELECT [ID], [Heard aobut] FROM [eCW].[Visits]")` and it worked but half of `varchar` variables conflicting with `VisitDateTime` variable. — DBhatta, Dec 08 '20 at 16:50
Try `CAST` or `CONVERT` for`datetime2` to simpler `datetime`. For `varchar`, find max length of existing data in `varchar` and `CAST` to smaller `varchar`s in `SELECT`. I don't know your data but ask yourself do values run to 255 characters? If still a challenge, see very last solution to merge together an R data frame from separate query results. You really have to play around with solution. Please try your best. There's no single one-liner for your issue. Finally, have SQL Server dump data to text to read into R. Good luck! — Parfait, Dec 08 '20 at 18:04

How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package

1 Answers1