3

Is there a way to locate an encoding problem within an XML file? I'm trying to parse such a file (let's call it doc) with the XML library in R, but there seems to be a problem with the encoding.

xmlInternalTreeParse(doc, asText=TRUE)
Error: Document labelled UTF-16 but has UTF-8 content.
Error: Input is not proper UTF-8, indicate encoding!
Error: Premature end of data in tag ...

and a list of tags with presumably premature end of data follows. However, I'm pretty sure that no premature ends exist in this document.

Ok, so next try:

doc <- iconv(doc, to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

and again a list of tags follows along with line numbers. I've checked the lines and I can't find any errors.

Another suspicion: the "µ"-character that occurs in the document might cause the error. So next try:

doc <- iconv(doc, to="UTF-8")
doc <- gsub("µ", "micro", doc)
doc <- sub("utf-16", "utf-8", doc)
xmlInternalTreeParse(doc, asText=T)
Error: Premature end of data in tag...

Any other suggestions for debugging?

EDIT: After having spent two days with trying to fix the error, I still haven't found a solution. However, I think I have narrowed down the possible answers. Here is what I've found:

  • copying the XML string from the source database into a file and saving it as a separate xml file in Notepad++ --> Document labelled UTF-16 but has UTF-8 content.

  • changing <?xml version="1.0" encoding="utf-16"?> to <?xml version="1.0" encoding="utf-8"?> (or encoding="latin1") within this file --> no error

  • reading XML string from database via doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1], manipulating it with str_sub(doc, 35, 36) <- "8" or str_sub(doc, 31, 36) <- "latin1" and then trying to parse it with xmlInternalTreeParse(doc) --> Premature end of data in tag...

  • reading the XML string from database as above and then trying to parse it with xmlInternalTreeParse(doc) --> Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... (list of tags follows).

  • reading the XML string from database as above and parsing with xmlInternalTreeParse(doc, encoding="latin1") --> Premature end of data in tag...

  • using doc <- iconv(doc[1,1], to="UTF-8") or to="latin1" before parsing doesn't change anything

I would appreciate any suggestions very much.

AnjaM
  • 2,941
  • 8
  • 39
  • 62
  • It's extremely hard to answer a question of this nature without a reproducible example – hadley Nov 21 '12 at 22:53
  • 1
    @hadley I don't have any idea how to provide an MWE here. When I change "UTF-16" to "UTF-8" in the document header and then copy the content of this file into another empty file and save it in exactly the same way, the `Document labelled UTF-16...` error disappears. Changing the header in the original file and saving the changes doesn't help. But I can't use this procedure every time as I need this script to automatically process data from a database. I'm puzzled and don't know how to debug or even how to provide an example as it seems not to be the content itself that causes the problem. – AnjaM Nov 22 '12 at 07:37
  • I know I debugged a similar problem a couple of months ago, but I can't remember exactly what I did. One other thing you can experiment with is to load the xml with `xmlInternalTreeParse(file(doc, encoding = "utf-16"))` and see if setting the encoding there helps. – hadley Nov 22 '12 at 13:32
  • @hadley Thanks for your suggestion. I've edited my first posting and listed the things I've tried so far. I did try to specifiy the encoding, but this doesn't help. Oddly enough, replacing `utf-16` by `utf-8` or by `latin1` within the saved file with Notepad++ solves the problem. But doing the same by string manipulation after having imported the XML-string from SQL database into an `R` object doesn't help. – AnjaM Nov 22 '12 at 13:47
  • FYI your `iconv` call is unlikely to be correct - you usually need to specify both from and to. – hadley Nov 22 '12 at 17:28
  • @hadley I found out that this way it works for short XML files (even without specifiying the `from` argument), and after having a closer look at the file which is not working, I found that apparently the file gets chopped off. So it seems there were two problems: retrieving and encoding. The one with the encoding is solved now. I opened a new question on how to retrieve large strings as I think it's not within the scope of this question: http://stackoverflow.com/questions/13525539/how-to-retrieve-a-very-long-xml-string-from-an-sql-database-with-r Anyway, thanks a lot for your suggestions! – AnjaM Nov 23 '12 at 08:50
  • @AnjaM you might answer your own question now that you have the answer to at least part of it. I found your `doc <- gsub("utf-16", "utf-8", doc); doc <- xmlInternalTreeParse(doc, asText=T)` snippet useful in parsing output from the Coding Analysis Toolkit http://cat.ucsur.pitt.edu/, but almost left the page when I saw your question didn't have an answer. Thanks. – Solomon Nov 30 '12 at 05:29

1 Answers1

4

The encoding problem occurred because the encoding of the original XML file and the encoding within the SQL database where the XML content was stored as longtext didn't match. Substituting the specification of the encoding within the XML string and converting this string solved the problem:

doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE)
doc <- iconv(doc[1,1], to="UTF-8")
doc <- sub("utf-16", "utf-8", doc)
doc <- xmlInternalTreeParse(doc, asText = TRUE)

Truncating of the XML string during retrieval from the database turned out to be a separate problem. The solution is provided here: How to retrieve a very long XML-string from an SQL database with R?

Community
  • 1
  • 1
AnjaM
  • 2,941
  • 8
  • 39
  • 62