2

I am retrieving online XML data using the XML R packages. My issue is that the UTF-8 encoding is lost during the call to xmlToList : for instance, 'é' are replaced by 'é'. This happens during the XML parsing.

Here is a code snippet, with an example of encoding lost and another where encoding is kept (depending of the data source) :

library(XML)
library(RCurl)

url = "http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2"
res <- getURL(url)
xmlToList(res)
# encoding lost

url2 = "http://www.bdm.insee.fr/series/sdmx/conceptscheme/"
res2 <- getURL(url2)
xmlToList(res2)
# encoding kept

Why the behaviour about encoding is different ? I tried to set .encoding = "UTF-8" in getURL, and to enc2utf8(res) but that makes no change.

Any help is welcome !

Thanks,

Jérémy

R version 3.2.1 (2015-06-18)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.7 bitops_1.0-6   XML_3.98-1.3  

loaded via a namespace (and not attached):
[1] tools_3.2.1
Christophe Roussy
  • 16,299
  • 4
  • 85
  • 85
jlesuffleur
  • 1,113
  • 1
  • 7
  • 19

1 Answers1

2

You are trying to read SDMX documents in R. I would suggest to use the rsdmx package that makes easier the reading of SDMX documents. The package is available on CRAN, you can also access the latest version on Github.

rsdmx allows you to read SDMX documents by file or url, e.g.

require(rsdmx)
sdmx = readSDMX("http://www.bdm.insee.fr/series/sdmx/data/DEFAILLANCES-ENT-FR-ACT/M.AZ+BE.BRUT+CVS-CJO?lastNObservations=2")
as.data.frame(sdmx)

Another approach is to use the web-service interface to embedded data providers, and INSEE is one of them. Try:

sdmx <- readSDMX(providerId = "INSEE", resource = "data",
                 flowRef = "DEFAILLANCES-ENT-FR-ACT",
                 key = "M.AZ+BE.BRUT+CVS-CJO", key.mode = "SDMX",
                 start = 2010, end = 2015)
as.data.frame(sdmx)

AFAIK the package also contains issues to the character encoding, but i'm currently investigating a solution to make available soon in the package. Calling getURL(file, .encoding="UTF-8") properly retrieves data, but encoding is lost calling xml functions.

Note: I also see you use a parameter lastNObservations. For the moment the web-service interface does not support extra parameters, but it may be made available quite easily if you need it.

eblondel
  • 603
  • 4
  • 10
  • Thanks for redirecting to this package ! It seems very convenient, I will have a deeper look at it. Keep me updated if you find a solution for the encoding problem. – jlesuffleur Nov 03 '15 at 16:47
  • @jlesuffleur sorry to come back only now. From rsdmx version 0.5-2, i've put in place some mechanism to make sure proper encoding is set. Note that in rsdmx, we wrap the SDMX document with dedicated classes, at this level i didn't (yet) find a way to properly encode textual content, hence the encoding fix is applied when you call a ``as.data.frame`` function on a R ``SDMX*`` object. Do not hesitate to follow rsdmx on Github, and in case report bugs or suggest improvements. – eblondel Mar 21 '16 at 11:17