1

I have a lot of data in a MSSQL database. I've written a query to get me the data that I need and would like to load that data into R to analyze.

Problem: When loading the data into R, my CustomerID numbers are being altered, i.e. CustomerID in SQL server will display as "xxxx0101" [I've truncated the CustomerID field with xxxx to show it's only be altered near the end] but in R it shows up as "xxxx2724" (quotes are not to indicate strings: datatype in SQL is varchar(30); R appears to auto-assign the datatype). In addition it will do rounding on my DataValues (SQL: 2.84....7366 R: 2.84.....6575)

SQL Table structure:

CustomerID  | DataValue
xxxxxxx0101 | 2.84...7366

In R the same info appears as:

CustomerID  | DataValue
xxxxxxx2724 | 2.84...6575

And I don't understand why R is changing the values of my fields.

R Code:

require(RODBC)
myScript = paste(gsub("\t", " ", readLines("TestQuery.sql")), collapse = " ")
myData = data.frame(sqlQuery(myDBConnection, myCode))
View(myData)

The paste() and gsub() functions were added to the main readLines() to get rid of stuff like \t (for tab) and to store my multi-line SQL query into one string. I do not believe that is causing the problem, I've pasted the result of myScript directly back into MSSQL server and verified that it returns the exact same result as TestQuery.sql

I'm using the RODBC package to connect my R and SQL. Not sure if it's the right package to be using, if there's a newer package that doesn't have this probelm. If I'm running a small test query something like:

data.frame(sqlQuery(myDBConnection, "select top 10 CustomerID from CustList 
order by CustomerID"))

I only have the problem with some of the CustomerIDs others come in correct.

Fun fact: my co-worker ran my SQL query, created a SAS data set from it, and then loaded the data set from SAS into R as a data frame and did not have the issues I've mentioned above (he used the Haven library in R to read in his SAS data set)

UPDATE: I attempted to first initialize an empty data frame of type character and then load in each CustomerID by one at a time using stringsAsFactors = FALSE. Code below:

foo = sqlQuery(myDBConnection, "select top 10 CustomerID from CustList 
order by CustomerID", stringsAsFactors = FALSE)
testDF = data.frame(CustomerID = character(), stringsAsFactors = FALSE)
for( i in 1:10 )

{
   testDF[i,1] = foo[i,1]
}
View(testDF)

However, 8 of 10 of the CustomerIDs come in unchanged, the two that are changed are altered by +1 at the end. I'm thinking it has something to do with the fact that the next one in the series is just one off from the prior, i.e. it will be xxxx100 then xxxx101 with xxxx100 printing twice. Note: each CustomerID is unique so this does not happen in the original data set and has been verified.

I would like to loop call my sqlQuery() rather than loading the data first into 'foo' but I can't figure out place an iterative variable in my sqlQuery function. If anybody knows, it would be greatly appreciated.

  • It's unsurprising that the data types would be different between 2 systems with fundamentally different data types. However I've never heard of R arbitrarily changing numbers and in all the years I've been doing it this never happened to me. Very peculiar... could you give a reproducible example using an `RODBC` query of a builtin dataset like in the `RODBC` vignette? – Hack-R Aug 22 '16 at 22:39
  • in your SQL data is the data value for customer `2724` `2.84...6575`? Or is it something else entirely? – Chris Aug 22 '16 at 22:43
  • Re: @Chris The CustomerID is x0101 (truncated) and the value is 2.84..(truncated)..7366 – How About It Aug 22 '16 at 22:48
  • Re: @Hack-R I'm new to R and not sure how to give you a builtin database to test – How About It Aug 22 '16 at 22:49
  • @HowAboutIt I think it's on page 4 of the vignette but I'm going by memory. I'll check. *Update* Yea, page 4 https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf You could also just make up your own sample data tho. – Hack-R Aug 22 '16 at 22:50
  • 1
    What is the result of the query `SELECT DataValue FROM sql_table WHERE CustomerID = 'xxxxxxx2724'` when you run it in SQL (replacing sql_table with your table name, and the xxx's with the numbers you see in R). What I'm getting at is that if your ID and Value are different, its probably not that the id and value are wrong, but that you are pulling the wrong pair – Chris Aug 22 '16 at 23:02
  • @Chris nothing that CustomerID does not exist – How About It Aug 22 '16 at 23:05
  • I may have a theory (from painful personal experience) about what's going on. Two questions: how many digits are the ids? Are they always digits (but stored as varchar)? Never any letters? – joran Aug 22 '16 at 23:56
  • I'm a little busy with stuff at home and so I can't monitor this question, but try installing [this](https://github.com/joranE/RODBC) version of RODBC from github via `devtools::install_github` and see if that fixes the problem. I have had problems with RODBC _ignoring_ varchar declarations for db fields (Oracle) and converting the transferred data via `type.convert()`. – joran Aug 23 '16 at 00:03

2 Answers2

0

If your version of SQL Server is 2016 you can use R functionalities in SQL Server itself, and it will not give this conversion errors

Kannan Kandasamy
  • 13,405
  • 3
  • 25
  • 38
0

In the scenario you describe, it sounds like you are simply getting 2 different records returned from SQL (each with their own unique CustomerID values) because you are not explicitly ordering or filtering your data.

Here is a Stack Overflow discussion on this topic

Your test sql query:

"select top 10 CustomerID from CustList"

is not guaranteed to get the same "top 10" records from SQL without an explicit WHERE or ORDER BY statement.

Adding an ORDER BY statement to the end of your SQL query will ensure that you get the same order of records returned every time.

"select top 10 CustomerID from CustList order by CustomerID"
Community
  • 1
  • 1
weskimmo
  • 126
  • 3