1

Problem

I have an XML file that I would like to parse in R. I know that this file is not corrupted because the following Python code seems to work:

>>> import xml.etree.ElementTree as ET
>>> xml_tree = ET.parse(PATH_TO_MY_XML_FILE)
>>> do_my_regular_xml_stuff_that_seems_to_work_no_problem(xml_tree)

Now, when I try to run the following code in R, I get an error message:

> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE)

Error in nchar(text_repr): invalid multibyte string, element 1
Traceback:


Alright, maybe the parser doesn't recognize the encoding. Luckily this should be specified in a decent XML file. So, I go to my shell and check:

$ head -n1 PATH_TO_MY_XML_FILE

??<?xml version="1.0" encoding="utf-16"?>

Now, I can go back to R and explicitly pass on the encoding, only to face the next error message where I got stuck now:

> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE, encoding='UTF-16')

Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found

Traceback:

1. XML::xmlParse(filePath, encoding = "UTF-16")
2. (function (msg, ...) 
 . {
 .     if (length(grep("\\\n$", msg)) == 0) 
 .         paste(msg, "\n", sep = "")
 .     if (immediate) 
 .         cat(msg)
 .     if (length(msg) == 0) {
 .         e = simpleError(paste(1:length(messages), messages, sep = ": ", 
 .             collapse = ""))
 .         class(e) = c(class, class(e))
 .         stop(e)
 .     }
 .     messages <<- c(messages, msg)
 . })(character(0))

A last attempt to check (in R) if the file is in fact "UTF-16" encoded yields:

> f <- file(filePath, 'r', encoding = "UTF-16")
> firstLine <- readLines(f, n=1)
> close(f)
> print(line)

[1] "<?xml version=\"1.0\" encoding=\"utf-16\"?>"

Which looks just about right to me.


Question(s)

Does anyone know what is happening here? Is this a bug from the XML library? Is the file maybe not 'UTF-16' encoded, even though it claims it is? What are the two question marks ?? that I see when I print the file into the shell? These question marks don't appear when reading in the file properly...

chickenNinja123
  • 311
  • 2
  • 11
  • Please open XML in a text editor and post a sample of its content in body of question. Well-formed XML files cannot have *any* character or entity (visible or not) before the header declaration. – Parfait Oct 22 '20 at 23:23
  • 1
    If I open the file with Atom the default display of the first line reads `��`, If I select a UTF-16 encoding, the first line reads ``. A hex dump in my shell (e.g. `$ xxd PATH_TO_MY_FILE`) for the first few characters of the file yields: `fffe 3c00 3f00 7800 6d00 6c00 2000 7600` – chickenNinja123 Oct 23 '20 at 07:58
  • Try to adjust your R IDE or session with UTF-16 encoding. – Parfait Oct 23 '20 at 18:13

1 Answers1

0

Is this a bug from the XML library?

I think there could be a bug here. If I generate a valid UTF-16 XML document, which will have an initial byte-order mark:

$ echo '<a></a>' | iconv -t utf-16 >a-utf16.xml
$ xxd a-utf16.xml 
00000000: fffe 3c00 6100 3e00 3dd8 0ade 3c00 2f00  ..<.a.>.=...<./.
00000010: 6100 3e00 0a00                           a.>...

then I can parse it with:

> XML::xmlParse('a-utf16.xml')
<?xml version="1.0"?>
<a>&#x1F60A;</a>

but not if I specify the encoding:

> XML::xmlParse('a-utf16.xml', encoding='utf-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found

Your original problem was when you weren't specifying the encoding. However:

I know that this file is not corrupted because the following Python code seems to work

That's a good hint, but I think you'll find edge cases where that doesn't hold. Try iconv for a second opinion on whether the file is encoded correctly.

For a more specific response, you'll need to post a reproducible XML file,

Joe
  • 29,416
  • 12
  • 68
  • 88