1

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.

Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.

The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)

I've made some progress but processing for larger files takes too long (see below)

The XML looks like this:

<ROOT>
 <ROWSET_DUASDIA>
  <ROW_DUASDIA NUM="1">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
  ...
  <ROW_DUASDIA NUM="150236">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
 </ROWSET_DUASDIA>
</ROOT>

I hope that's clear enough. This is my first time working with an XML.

I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did

library(xml2) 

raw <- read_xml(filename)

# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")

# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))

# For loop to enter the data for each record in each row
for (j in 1:dualen) {
  df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}

# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))

I imagine that's fairly rudimentary but I'm also pretty new to R.

Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.

I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.

Thanks in advance.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
GranDurismo
  • 13
  • 1
  • 4

2 Answers2

3

While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.

library(XML)

doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))

You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.

dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"

xml_process <- function(dt) {      
  tryCatch({
    # DOWNLOAD ZIP TO URL
    url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
    file <- paste0(xml_folder, "/dding2018", dt, ".xml")

    download.file(url, temp_zip)
    unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
    unlink(temp_zip)           # DESTROY TEMP ZIP

    # PARSE XML TO DATA FRAME
    doc <- xmlParse(file)        
    df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
                    DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
    unlink(file)               # DESTROY TEMP XML

    # RETURN XML DF
    return(df)
  }, error = function(e) NA)      
}

# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)

# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)

# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I was going for a similar approach, processing daily XMLs and then binding them. Then I noticed that for years 2001-2017 the daily files were removed and only monthlies were kept (I assume they delete all dailies when they have every monthly for the year completed). What this means is that I'll have to find a way to deal with monthly XMLs. Everything I've tried with the `xml2` package takes several hours to complete on monthly XMLs with 150.000 records. Right now I'm trying your approach with `xmlToDataFrame`. If that doesn't work, I'll have to start think about splitting XMLs somehow. – GranDurismo Dec 28 '18 at 16:50
  • Update: February 2018 (130k records) took 19 minutes using `xmlToDataFrame`. That's very encouraging! Thank you very much! I'll test a few "heavier" months and report back. – GranDurismo Dec 28 '18 at 17:08
  • Good to hear. Yes, the unfortunate aspect is for XML files, you have to load entire document in memory before parsing. Close all other apps before running in R. If you process one month, try destroying helper objects `rm(doc, dec_df_list)` and free resources with `gc()`, or [save data frame to disk](https://stackoverflow.com/a/19967638/1422451) and restart R session. – Parfait Dec 28 '18 at 20:53
1

Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.

This version can also handle cases where the number of variables per record is different.

library(xml2)
doc<-read_xml('<ROOT>
 <ROWSET_DUASDIA>
              <ROW_DUASDIA NUM="1">
              <variable1>value1</variable1>
              <variable191>value2</variable191>
              </ROW_DUASDIA>
              <ROW_DUASDIA NUM="150236">
              <variable1>value3</variable1>
              <variable2>value_new</variable2>
              <variable191>value4</variable191>
              </ROW_DUASDIA>
              </ROWSET_DUASDIA>
              </ROOT>')

#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")

#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)

#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))

#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength), 
       variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)

#dataframe is in a long format.  
#Use the function cast, or spread from the tidyr to convert wide format
#      NUM    variable    values
# 1      1   variable1    value1
# 2      1 variable191    value2
# 3 150236   variable1    value3
# 4 150236   variable2 value_new
# 5 150236 variable191    value4

#Convert to wide format
library(tidyr)
spread(df, variable, values)
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • Oh wow, that looks a lot simpler. However, it's spent the last hour or so in the nodenames part. No errors, just still processing. – GranDurismo Dec 27 '18 at 20:05
  • My bad, I missed that. got stuck on processing the nodenames and values. I'll try on my desktop when I can or I'll have to process every daily file and then combine them. – GranDurismo Dec 27 '18 at 22:48
  • 1
    You only have to process one of the nodes to get the names, given that the names are the same for all the nodes. Use the following to get the names: `nodenames <- xml_name(xml_children(nodes[1]))`. – Edward Carney Dec 28 '18 at 02:36
  • @EdwardCarney Yes, I did just that, but I do need to get the values for all the records and that's very slow to process as well – GranDurismo Dec 28 '18 at 03:29