I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2
. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list
function within xml2
so I could continue with plyr
, but this also took too long.
Thanks in advance.