4

I have a script to get an XML file from an SQL database. Here is how I do this:

library(RODBC)
library(XML)

myconn <- odbcConnect("mydsn")

query.text <- "SELECT xmlfield FROM db WHERE id = 12345"
doc <- sqlQuery(myconn, query.text, stringsAsFactors=FALSE)
doc <- iconv(doc[1,1], from="latin1", to="UTF-8")
doc <- xmlInternalTreeParse(doc, encoding="UTF-8")

However, the parsing didn't work for a particular database row, although it worked when I copied the content of this field into a separate file and parsed from the file. After two days "trial-and-error" I identified the main problem. It seems that querying short XML files this way doesn't cause any problems, but when I query larger files, the string gets chopped off after 65534 characters. Therefore, the end of the XML file is missing and the file can't be parsed.

I thought this might be an overall restriction of the ODBC connections on my computer. However, another programme that also uses ODBC to get the same XML field from the same database does this without any problems. So I guess it's an R-specific problem.

Any ideas how to fix it?

AnjaM
  • 2,941
  • 8
  • 39
  • 62
  • I don't have an example to test! but dis you try to convert the xmlfield with (cast/convert) or with as.is option? – agstudy Nov 23 '12 at 12:57
  • @agstudy I tried to set `as.is = TRUE` within `sqlQuery`, but that didn't help. What do you mean by converting the xmlfield? – AnjaM Nov 23 '12 at 13:51
  • something like : select cast(xmlfield as varchar(255)).. – agstudy Nov 23 '12 at 14:10
  • @agstudy For some reason, `varchar` doesn't work, but I modified the statement as `SELECT CAST(xmlfield as CHAR(150000)) FROM...` and the `sqlQuery` worked this way. However, the problem persists that the content of the field is chopped off after 65534 characters. – AnjaM Nov 23 '12 at 14:57

2 Answers2

4

I've written to the package author and have finally received the following answer:

Your inability to read is not my problem, nor is it a reasonable excuse.

The manual says

'\item[Character types] Character types can be classified three ways: fixed or variable length, by the maximum size and by the character
set used. The most commonly used types\footnote{the SQL names for
these are \code{CHARACTER VARYING} and \code{CHARACTER}, but these
are too cumbersome for routine use.} are \code{varchar} for short
strings of variable length (up to some maximum) and \code{char} for
short strings of fixed length (usually right-padded with spaces).
The value of `short' differs by DBMS and is at least 254, often a
few thousand---often other types will be available for longer
character strings. There is a sanity check which will allow only
strings of up to 65535 bytes when reading: this can be removed by
recompiling \pkg{RODBC}.'

This manual can be found in the doc directory of the RODBC package. This information is not contained within the reference manual.

As in the meantime I've found a good solution to retrieve my data without using RODBC, I haven't tried to recompile this package. But I hope this answer will be helpful for those having trouble with the same issue.

Community
  • 1
  • 1
AnjaM
  • 2,941
  • 8
  • 39
  • 62
3

If you want to change the source of RODBC and recompile it is fairly easy using github and the devtools package:

  1. fork the repo here: https://github.com/cran/RODBC
  2. comment out the line (this one from the R-3.03 release): https://github.com/cran/RODBC/blob/R-3.0.3/src/RODBC.c#L734

            if (datalen > 65535) datalen = 65535;
    
  3. (re)install from github:

    devtools::install.github("<yourgithubname>/RODBC")
    

Now you should be able to read in large strings. Something to note though, you may get errors due to trying to allocate too much memory (the line following the sanity check is:

    thisHandle->ColData[i].pData = Calloc(nRows * (datalen + 1), char);

hence the simplest way to proceed is set the argument rows_at_time = 1 in your sqlQuery call from R

HTH