Most efficient way to extract data from large XML files in R

Question

I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):

<recordGroup>
  <records>
    <record>
      <recordId>123442</recordId>
      <reportingCountry>PT</reportingCountry>
      <date>2020-02-20</date>
      <field>
        <fieldName>Gender</fieldName>
        <fieldValue>F</fieldValue>
      </field>
      <field>
        <fieldName>Age</fieldName>
        <fieldValue>57</fieldValue>
      </field>
      <field>
        <fieldName>ClinicalSymptoms</fieldName>
        <fieldValue>COUGH</fieldValue>
        <fieldValue>FEVER</fieldValue>
        <fieldValue>O</fieldValue>
        <fieldValue>RUNOS</fieldValue>
        <fieldValue>SBREATH</fieldValue>
      </field>
    </record>
  </records>
</recordGroup>

I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse using the code below, but this seems to be even more inefficient.

value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0

xmlEventParse(xmlDoc_clean,
              list(
                startElement = function (name, attrs) {
                  tagName <<- name
                },
                text = function (x) {
                  if (nchar(x) > 0) {
                    if (tagName == "recordId") {
                      rec <<- x
                    } else
                    if (tagName == "fieldName") {
                      var_f <<- x
                    } else {
                      if (tagName == 'fieldValue') {
                        v <- x
                         nvar <<- nvar + 1
                       value_df[nvar, 1:3] <<- c(rec, var_f, v)
                      }
                    }
                  }
                },
                endElement = function (name) {
                  if (name == 'record') {
                    print(nvar)
                  }
                }
              ))

I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.

Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.

Did you see the update in answer here? https://stackoverflow.com/questions/33446888/r-convert-xml-data-to-data-frame — ibilgen, Dec 29 '20 at 20:40
@ibilgen Thanks yes - i have tried similar approaches, but trying to find the most efficient way as my datasets are really large and can take a couple of hours to parse with XML2. I haven't yet figured out how to use XSLT which is mentioned in another answer to that question. — Gakku, Dec 29 '20 at 21:07
See [R: xmlEventParse with Large, Varying-node XML Input and Conversion to Data Frame](https://stackoverflow.com/q/46187461/1422451). Accepted solution uses `xmlEventParse`. — Parfait, Dec 29 '20 at 21:33
@Parfait the approach using branches seems to work as well and looks slightly faster than using handlers. I'll keep working on it and post the code, hopefully might be useful for others. — Gakku, Dec 30 '20 at 16:11

score 1 · Answer 1 · answered Dec 29 '20 at 22:27

When memory is a challenge, consider hard disk. Specifically, consider building a large CSV version of extracted parsed XML data with iterative append calls via write.csv in an xmlEventParse run:

# INITIALIZE EMPTY CSV WITH EMPTY ROW
csv <- file.path("C:", "Path", "To", "Large.csv")
fileConn <- file(csv); writeLines(paste0("id,tag,text"), fileConn); close(fileConn)

i <- 0
doc <- file.path("C:", "Path", "To", "Large.xml")
output <- xmlEventParse(doc,
                        list(startElement=function(name, attrs){
                          if(name == "recordId") {i <<- i + 1}
                          tagName <<- name
                        }, text=function(x) {
                          if(nchar(trimws(x)) > 0) {
                            write.table(data.frame(id=i, tag=tagName, text=x), 
                                        file=csv, append=TRUE, sep=",", 
                                        row.names=FALSE, col.names=FALSE)
                          }
                        }),
                        useTagName=FALSE, addContext=FALSE)

Output

Obviously, further data wrangling will be needed for proper row/column migration. But you can now read large CSV with the many tools out there or via chunks.

id,tag,text
1,"recordId","123442"
1,"reportingCountry","PT"
1,"date","2020-02-20"
1,"fieldName","Gender"
1,"fieldValue","F"
1,"fieldName","Age"
1,"fieldValue","57"
1,"fieldName","ClinicalSymptoms"
1,"fieldValue","COUGH"
1,"fieldValue","FEVER"
1,"fieldValue","O"
1,"fieldValue","RUNOS"
1,"fieldValue","SBREATH"

Thanks - I had been trying a similar approach but didn't manage. Your example works very well but writing to csv seems to take time. It only managed a 100 or so records in a minute. I tried to convert your code to produce a dataframe which is a bit faster but still rather slow (10 000 records in 3.5 minutes - I have 1.5 million to go through). I get the feeling that I have to accept that anything less than an hour to process is good enough! — Gakku, Dec 29 '20 at 23:42
RAM is usuallly faster than disk reads. But avoid growing large data frame in memory. This solution tries to avoid a large, growing RAM footprint. Restart machine and avoid running other apps and IDEs like RStudio and run this at command line with `RScript`. Don't store XML or CSV to external or network drives but hard disk drive. — Parfait, Dec 30 '20 at 00:53

score 0 · Answer 2 · answered Jan 09 '21 at 09:26

In the end the fastest approach I found was the following:

Split the XML files in smaller chunks using XML2. I have >100GB RAM on the server I am working on so could parallelize this process using foreach with 6 workers, but mileage varies depending on how much RAM is available.
The function splitting the files returns a data.frame with the location of the split files.
Process the smaller XML files in a foreach loop - this time it is possible to use all cores so I have gone with 12 workers. The processing uses XML2 as I found that to be the fastest way. Initially the extracted data is in a long format but I then convert to a wide format within the loop.
The loop binds and outputs the different dataframes into one large dataframe. The final step is using fwrite to save the csv file. This seems to be the most efficient way.

With this approach I can process a 2.6GB XML file in 6.5 minutes.

I will add code eventually but it is quite specific so need to generalize a bit.

How did you split the XML files with XML2? I'm facing the same problems but didn't find a satisfying soluten yet. — an_ja, May 21 '21 at 12:48
@an_ja apologies I only saw your comment now. Let's say I wanted 1000 records per file. I basically counted the number of records in the file, then created a vector with the start record I wanted for each split file (eg 1, 1001, 2001, 3001 etc). Then for each value I deleted the records I did not want (so for the first batch I deleted 1001:n, for the second one 1:1000 and 2001:n etc) so I ended up with smaller XML files with 1000 records each. I can dig up the code but in the end took a different approach for my problem (using XSLT which was as fast even without parallelization). — Gakku, Jul 26 '21 at 21:07

Most efficient way to extract data from large XML files in R

2 Answers2