5

I am reading an UTF-8 encoded XML file in R using xmlParse and xPathApply of Duncan Temple Lang's XML package. I have issues reading text from the file into a data frame for various languages. I am currently on Windows OS but this R script will be used on different machines so I need a solution that will be suitable for all. See sample XML file below:

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
    <L1 lang="zh-TW">使用者識別碼</L1>
    <L2 lang="vi-VN">ID người dùng</L2>
</CATALOG>

This text value is being displayed in an encoded format as in <U+4F7F><U+7528><U+8005><U+8B58><U+5225><U+78BC>, ID nguo`i du`ng respectively. Note this is just a sample and the actual XML file has text in different languages.

Code Snippet:

library(XML)
library(plyr)

getValues <- function(x) {
  List <- list()

  if(inherits(x, "XMLInternalElementNode")) {
    if(length(xmlValue(x, recursive=FALSE)) != 0) {
      List[[length(List)+1]] <- c(node = xmlName(x), value = xmlValue(x, recursive=FALSE))
    }
  }

return(List)
}

visitNode <- function(node, xpath = "//node()") {
  if (is.null(node)) {
    return()
  }

  result <- xpathSApply(node, path = xpath, getValues)

  if(is.list(result)) {
    dt <<- rbind.fill(lapply(result,function(y){as.data.frame(do.call(rbind, y),stringsAsFactors=FALSE)}))
  }
} 


xtree <- xmlParse("C:/Users/I308232/Desktop/test.xml")
root <- xmlRoot(xtree)
dt <- data.frame(node = NA, value = NA)
visitNode(root)
dt

sessionInfo() output:

R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_Australia.1252        LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RODBC_1.3-10 plyr_1.8.1   XML_3.98-1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.11.3 tools_3.1.2

Any help will be appreciated. Thanks.

3442
  • 8,248
  • 2
  • 19
  • 41
  • 2
    It's unclear to me exactly how you are extracting this. Can you make this sample [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? Specifically show how you are reading the XML in. What OS are you on? Are you sure the file itself is properly UTF-8 encoded? – MrFlick Dec 15 '14 at 03:41
  • I have updated my post with the code snippet. I am currently on Windows OS but this R script will be used on different machines so I need a solution that will be suitable for all. – user2877232 Dec 15 '14 at 04:13
  • Well, windows typically defaults to Latin-1 encoding. Try adding an explicit `encoding="UTF_8"` to the `xmlParse()` call. (at least it worked when I tested on my Mac, R 3.1.0, XML 3.98-1.1) – MrFlick Dec 15 '14 at 04:15
  • I put that in but no luck. I still get the same output. – user2877232 Dec 15 '14 at 04:19
  • What do you get for `Encoding(dt$value)`? – MrFlick Dec 15 '14 at 04:20
  • I get "UTF-8" for both values of the XML sample. – user2877232 Dec 15 '14 at 04:37
  • 1
    Then it sounds like the file may be improperly encoded. What do you get for `charToRaw(dt$value[2])?` I get `49 44 20 6e 67 c6 b0 c6 a1 cc 80 69 20 64 75 cc 80 6e 67` which appears to be the correct UTF-8 encoding for the string. – MrFlick Dec 15 '14 at 04:39
  • Yes I get the same output as yours for charToRaw. – user2877232 Dec 15 '14 at 04:59
  • So you have the right bytes and the right encoding, but it stills not displaying correctly?!? What GUI are you using? Add the contents of `sessionInfo()` to your question. That will give all relevant version numbers and locale settings. – MrFlick Dec 15 '14 at 05:01
  • Could it be that R does not use a font that can display those chars? –  Dec 15 '14 at 11:42
  • Also, where did you get the original file? Did you create it yourself? What happens if you use R to write that content to a file and then read it? –  Dec 15 '14 at 11:43
  • I just tested this on a windows maching and it seems to work just fine: `rr<-c("49","44","20","6e","67","c6","b0","c6","a1","cc","80","69","20","64","75","cc","80","6e","67");ss<-rawToChar(as.raw(strtoi(paste0("0x", rr))));Encoding(ss)<-"UTF-8";ss`. It returns: ""ID người dùng". The only real difference was I was using R 3.1.1. Maybe it's a problem unique to 3.1.2 – MrFlick Dec 16 '14 at 19:20
  • Hey @MrFlick I used the code you gave above and the text is displayed correctly. I wonder what is the problem when displaying in a data frame. Is it an issue with the data frame itself? – user2877232 Dec 17 '14 at 00:24
  • Do you get the error/bad display when you do `data.frame(name=ss)` (continuing the example)? – MrFlick Dec 17 '14 at 00:30
  • @MrFlick No error but the display is as follows: name 1 ``ID nguo`i du`ng`` – user2877232 Dec 17 '14 at 00:49

1 Answers1

0

I do not know how to convert an Excel file to different languages, but I hope this helps you read in the file.

A UTF-8 encoded XML error like the one below can happen in RStudio when an Excel spreadsheet's file name and sheet name have different extensions (i.e. .csv, .xlsx, .txt, etc.).

beantraitData = read_xlsx("traits_file_dev_sep.xlsx");
names(beantraitData)
"X..xml.version.1.0.encoding.UTF.8.."

In this case, I got an error because the name of the file was traits_file_dev_sep.xlsx and the name of the sheet (which you can see in the tabs near the bottom of the page) was traits_file_dev_sep.csv.

This happened to me after I converted a .csv file to an .xlsx file in Lubuntu.