I have a lot of data in a MSSQL database. I've written a query to get me the data that I need and would like to load that data into R to analyze.
Problem: When loading the data into R, my CustomerID numbers are being altered, i.e. CustomerID in SQL server will display as "xxxx0101" [I've truncated the CustomerID field with xxxx to show it's only be altered near the end] but in R it shows up as "xxxx2724" (quotes are not to indicate strings: datatype in SQL is varchar(30); R appears to auto-assign the datatype). In addition it will do rounding on my DataValues (SQL: 2.84....7366 R: 2.84.....6575)
SQL Table structure:
CustomerID | DataValue
xxxxxxx0101 | 2.84...7366
In R the same info appears as:
CustomerID | DataValue
xxxxxxx2724 | 2.84...6575
And I don't understand why R is changing the values of my fields.
R Code:
require(RODBC)
myScript = paste(gsub("\t", " ", readLines("TestQuery.sql")), collapse = " ")
myData = data.frame(sqlQuery(myDBConnection, myCode))
View(myData)
The paste() and gsub() functions were added to the main readLines() to get rid of stuff like \t (for tab) and to store my multi-line SQL query into one string. I do not believe that is causing the problem, I've pasted the result of myScript directly back into MSSQL server and verified that it returns the exact same result as TestQuery.sql
I'm using the RODBC package to connect my R and SQL. Not sure if it's the right package to be using, if there's a newer package that doesn't have this probelm. If I'm running a small test query something like:
data.frame(sqlQuery(myDBConnection, "select top 10 CustomerID from CustList
order by CustomerID"))
I only have the problem with some of the CustomerIDs others come in correct.
Fun fact: my co-worker ran my SQL query, created a SAS data set from it, and then loaded the data set from SAS into R as a data frame and did not have the issues I've mentioned above (he used the Haven library in R to read in his SAS data set)
UPDATE: I attempted to first initialize an empty data frame of type character and then load in each CustomerID by one at a time using stringsAsFactors = FALSE. Code below:
foo = sqlQuery(myDBConnection, "select top 10 CustomerID from CustList
order by CustomerID", stringsAsFactors = FALSE)
testDF = data.frame(CustomerID = character(), stringsAsFactors = FALSE)
for( i in 1:10 )
{
testDF[i,1] = foo[i,1]
}
View(testDF)
However, 8 of 10 of the CustomerIDs come in unchanged, the two that are changed are altered by +1 at the end. I'm thinking it has something to do with the fact that the next one in the series is just one off from the prior, i.e. it will be xxxx100 then xxxx101 with xxxx100 printing twice. Note: each CustomerID is unique so this does not happen in the original data set and has been verified.
I would like to loop call my sqlQuery() rather than loading the data first into 'foo' but I can't figure out place an iterative variable in my sqlQuery function. If anybody knows, it would be greatly appreciated.