6

What is the fastest way to convert XML files to data frames in R currently?

The XML looks like this: (Note- not all rows have all fields)

  <row>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </row>
  <row>
    <ID>001</ID>
    <age>50</age>
    <field4 />
  </row>

I have tried two approaches:

  1. The xmlToDataFrame function from the XML library
  2. The speed oriented xmlToDF function posted here

For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.

These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.

Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?

Some light on this would be really helpful!

user997943
  • 303
  • 1
  • 5
  • 12
  • 1
    i think that `xml.etree.ElementTree` in python only do parsing, but not turning it into table....am i correct? – Randy Lai Apr 06 '14 at 08:34
  • 0.4 seconds is the total time taken starting with the XML file and writing a table in csv format – user997943 Apr 06 '14 at 15:09
  • If you look at the the function xmlToDF,` xpathSApply(doc, paste0(xpath, "/", x), xmlValue)` is actually doing multiple searches on the fields. I guess you can make this more efficient by reading all the fields simultaneously. – Randy Lai Apr 06 '14 at 23:27

2 Answers2

4

Updated for the comments

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.


if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

It finishes in about 1.1 second for the same document.

Randy Lai
  • 3,084
  • 2
  • 22
  • 23
  • getChildrenStrings does work much faster. However, it returns a list of named vectors. Converting that list to a data.frame or data.table (using [rbind.fill.dt](http://stackoverflow.com/questions/18003717/is-there-any-efficient-way-than-rbind-filllist)) takes a very long time, making the whole process even longer than xmlToDF. (Remember that not all rows have all fields, and even field ordering is not guaranteed) – user997943 Apr 08 '14 at 19:58
  • you can first get the names of all the fields. Then, in the for loop, write the vector `v` into a `csv` file in the correct order. – Randy Lai Apr 09 '14 at 02:27
  • Thank you so much! The first variant is extremely fast (even if you add a read.table at the end to get a data.frame back). I'm now able to convert the same file in 0.436 s, which is quite comparable to python! – user997943 Apr 10 '14 at 13:59
  • i just learnt that data.table is much faster than data.frame. And note that the csv method would fail if the content of xml contains comma's. – Randy Lai Apr 10 '14 at 21:24
  • Hmm, data.table method looks good, will give it a try later (one would have to run type.convert on it). For the csv method, I'm using a very un-common separator (for my data), and am also removing the separator explicitly from the values before writing to the file – user997943 Apr 11 '14 at 18:24
  • ai..I did something wrong with the data.table (too new to data.table), the corrected version needs 1.1 sec... the csv method may be better.. – Randy Lai Apr 11 '14 at 22:03
3

Just in case it helps someone, I found this solution using data.table to be even faster in my use case, as it only converts data to data.table once is has finished looping over the rows:

library(XML)
library(data.table)

doc <- xmlParse(filename)
d <- getNodeSet(doc,  "//Data")
size <- xmlSize(d)

dt <- rbindlist(lapply(1:size, function(i) {
    as.list(getChildrenStrings(d[[i]]))
}))
martin
  • 103
  • 5