0

I need to feed external MSSQL server with a large amount of data calculated in R.

  • No direct access to DB is possible, so it must be an interim export file.
  • Excel format cannot be utilised due to number of data rows exceeding Excel capacity.
  • CSV would be fine, but there are many obstacles in the data itself like semicolons used in names, special characters, not closed quotations (odd number of ") and so on.

I am looking for the versatile method of transporting data from R to MSSQL database, independent of data content. If I were able to save DataFrame as a database containing single table to a MSSQL backup format file, that would satisfy the needs.

Any idea on how to achieve this? Any package available? Any suggestion would be appreciated.

cineS.
  • 93
  • 1
  • 7
  • Have you considered exporting your data to XML or JSON ? – John Cappelletti Jun 19 '20 at 14:44
  • CSV handles semi-colons and embedded quotes just fine. I'm inferring that you are needing to bulk-insert your data; are you using `sqlcmd` or `bcp` for that? Both *do* have significant problems (embedded quotes in "normal" CSV), but if you have embedded commas then you can choose a different field-separator (other than comma). – r2evans Jun 19 '20 at 14:47

1 Answers1

2

I'm inferring you're hoping to bulk-insert the data using bcp or sqlcmd. While neither one deals well with commas, embedded commas, and embedded quotes, you can work around this by using a different field separator (that is not contained within the data).

Setup:

evil_data <- data.frame(
  id = 1:2,
  chr = c('evil"string ,;\n\'', '",";:|"'),
  stringsAsFactors = FALSE
)
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2test (id INT, chr nvarchar(64))")
# [1] 0
DBI::dbWriteTable(con, "r2test", evil_data, create = FALSE, append = TRUE)
DBI::dbGetQuery(con, "select * from r2test")
#   id               chr
# 1  1 evil"string ,;\n'
# 2  2           ",";:|"

First, I'll use \U001 as the field separator and \U002 as the row separator. Those two should be "good enough", but if you have non-printable characters in your data, then you might either change your separators to other values or look for encoding options for the data (e.g., base64, though it might need to be stored that way).

write.table(evil_data, "evil_data.tab", sep = "\U001", eol = "\U002", quote = FALSE)
# or data.table::fwrite or readr::write_delim

Since I'm using bcp, it can use a "format file" to indicate separators and which columns on the source correspond with columns in the database. See references for how to create this file, but for this example I'll use:

fmt <- c("12.0", "2",
         "1       SQLCHAR             0       0       \"\001\"    1     id                                                               \"\"                          ", 
         "2       SQLCHAR             0       0       \"\002\"    2     chr                                                              SQL_Latin1_General_CP1_CI_AS")
writeLines(fmt, "evil_data.fmt")

From here, assuming bcp is in your PATH (you'll need an absolute path for bcp otherwise), run this in a terminal (I'm using git-bash on windows, but this should be the same in others). The second line is all specific to my database connection, you'll need to omit or change all of this for your own connection. The first line is your stuff

$ bcp [db_owner].[r2test] in evil_data.tab -F2 -f evil_data.fmt -r '\002' \
  -S '127.0.0.1,21433' -U 'myusername' -d 'mydatabase' -P ***MYPASS***

Starting copy...

2 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total     : 235    Average : (8.51 rows per sec.)

Proof that it worked:

DBI::dbGetQuery(con, "select * from r2test")
#   id                      chr
# 1  1        evil"string ,;\n'
# 2  2                  ",";:|"
# 3  1 1\001evil"string ,;\r\n'
# 4  2             2\001",";:|"

References:

r2evans
  • 141,215
  • 6
  • 77
  • 149