R: Memory Management during xmlEventParse of Huge (>20GB) files

Question

Building on this previous question (see here), I am attempting to read in many, large xml files via xmlEventParse whilst saving node-varying data. Working with this sample xml: https://www.nlm.nih.gov/databases/dtd/medsamp2015.xml.

The code below uses xpathSapply to extract the necessary values and a series of if statements to combine the values in a way that matches the unique value (PMID) to each of the non-unique values (LastName) within a record - for which there may be no LastNames. The goal is to write a series of small csv's along the way (here, after every 1000 LastNames) to minimize the amount of memory used.

When run on the full-sized data set, the code successfully outputs files in batches, however something is still being stored in memory that eventually causes a system error once all RAM is used. I've watched the task manager while the code runs and can see R's memory grow as the program progresses. And if I stop the program mid-run and then clear the R workspace, including hidden items, the memory still appears to be in use by R. It is not until I shutdown R is the memory freed up again.

Run this a few times yourself and you'll see R's memory usage grow even after clearing the workspace.

Please help! This problem appears to be common to others reading in large XML files in this manner (See for example comments in this question).

My code is as follows:

library(XML)

filename <- "~/Desktop/medsamp2015.xml"

tempdat <- data.frame(pmid=as.numeric(),
                      lname=character(), 
                      stringsAsFactors=FALSE) 
cnt <- 1
branchFunction <- function() {
  func <- function(x, ...) {
    v1 <- xpathSApply(x, path = "//PMID", xmlValue)
    v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
    print(cbind(c(rep(v1,length(v2))), v2))

    #below is where I store/write the temp data along the way
    #but even without doing this, memory is used (even after clearing)

    tempdat <<- rbind(tempdat,cbind(c(rep(v1,length(v2))), v2))
    if (nrow(tempdat) > 1000){
      outname <- paste0("~/Desktop/outfiles",cnt,".csv")
      write.csv(tempdat, outname , row.names = F)
      tempdat <<- data.frame(pmid=as.numeric(),
                            lname=character(), 
                            stringsAsFactors=FALSE)
      cnt <<- cnt+1
    }
  }
  list(MedlineCitation = func)
}

myfunctions <- branchFunction()

#RUN
xmlEventParse(
  file = filename, 
  handlers = NULL, 
  branches = myfunctions
)

Just updated the code so it should run after downloading: https://www.nlm.nih.gov/databases/dtd/medsamp2015.xml. Memory issue persists. — km5041, Nov 09 '17 at 17:16
Maybe you will use something like this tempdat[i] <- cbind(c(rep(v1,length(v2))), v2) instead of rbind. I try to realize it by myself, but I cant understand how works xmlEventParse. — Alex, Nov 10 '17 at 06:45
@Technophobe01 I get this problem whether I'm on Windows or Mac, which makes me think it has to do with xmlEventParse storing something in the memory that is hidden and non-erasable without an R shutdown — km5041, Nov 10 '17 at 11:50
@km5041 The library you are using is likely written in c or c++ it is allocating memory outside of the control of your R session. Thus, the memory leak is outside the R session but effects it. The way I navigate this type of problem is to partition the analysis across RScript instances. Break the work up. See: https://stackoverflow.com/questions/37264919/r-memory-not-released-in-windows/46800450#46800450 — Technophobe01, Nov 10 '17 at 15:08
@Technophobe01 thanks for the insight. As I already have downloaded my set of XML files, it sounds like the implication for me is to break them up into smaller files and parse individually. Any tips on how to efficiently split XML while preserving the structure? (Should I ask a new question?) — km5041, Nov 10 '17 at 16:56
@km5041 How big is each individual file? Can you point me to the source files? I was going to provide an example answer. — Technophobe01, Nov 10 '17 at 16:59
The source files are anywhere from 1-20GB, but they are confidential so I can’t share, but they very much resemble the medsamp15.xml file linked in the question. You could just copy that file a few times to mimic the real inputs. — km5041, Nov 10 '17 at 17:41
I would split them up, I just process each one individually - see code example — Technophobe01, Nov 10 '17 at 19:25

Technophobe01 · Accepted Answer · 2017-11-10T23:57:49.983

Here is an example, we have a launch script invoke.sh, that calls an R Script and passes the url and filename as parameters... In this case, I had previously downloaded the test file medsamp2015.xml and put in the ./data directory.

My sense would be to create a loop in the invoke.sh script and iterate through the list of target file names. For each file you invoke an R instance, download it, process the file and move on to the next.

Caveat: I didn't check or change your function against any other download files and formats. I would turn off the printing of the output by removing the print() wrapper on line 62.

print( cbind(c(rep(v1, length(v2))), v2))

See: runtime.txt for print out.
The output .csv files are placed in the ./data directory.

Note: This is a derivative of a previous answer provided by me on this subject: R memory not released in Windows. I hope it helps by way of example.

Launch Script

  1 #!/usr/local/bin/bash -x
  2
  3 R --no-save -q --slave < ./47162861.R --args "https://www.nlm.nih.gov/databases/dtd" "medsamp2015.xml"

R File - `47162861.R`

# Set working directory

projectDir <- "~/dev/stackoverflow/47162861"
setwd(projectDir)

# -----------------------------------------------------------------------------
# Load required Packages...
requiredPackages <- c("XML")

ipak <- function(pkg) {
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

ipak(requiredPackages)

# -----------------------------------------------------------------------------
# Load required Files
# trailingOnly=TRUE means that only your arguments are returned
args <- commandArgs(trailingOnly = TRUE)

if ( length(args) != 0 ) {
  dataDir <- file.path(projectDir,"data")
  fileUrl = args[1]
  fileName = args[2]
} else {
  dataDir <- file.path(projectDir,"data")
  fileUrl <- "https://www.nlm.nih.gov/databases/dtd"
  fileName <- "medsamp2015.xml"
}

# -----------------------------------------------------------------------------
# Download file

# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
  dir.create(dataDir)
}

# Now we check if we have downloaded the data already if not we download it

if (!file.exists(file.path(dataDir, fileName))) {
  download.file(fileUrl, file.path(dataDir, fileName), method = "wget")
}

# -----------------------------------------------------------------------------
# Now we extrat the data

tempdat <- data.frame(pmid = as.numeric(), lname = character(),
  stringsAsFactors = FALSE)
cnt <- 1

branchFunction <- function() {
  func <- function(x, ...) {
    v1 <- xpathSApply(x, path = "//PMID", xmlValue)
    v2 <- xpathSApply(x, path = "//Author/LastName", xmlValue)
    print(cbind(c(rep(v1, length(v2))), v2))

    # below is where I store/write the temp data along the way
    # but even without doing this, memory is used (even after
    # clearing)

    tempdat <<- rbind(tempdat, cbind(c(rep(v1, length(v2))),
      v2))
    if (nrow(tempdat) > 1000) {
      outname <- file.path(dataDir, paste0(cnt, ".csv")) # Create FileName
      write.csv(tempdat, outname, row.names = F) # Write File to created directory
      tempdat <<- data.frame(pmid = as.numeric(), lname = character(),
        stringsAsFactors = FALSE)
      cnt <<- cnt + 1
    }
  }
  list(MedlineCitation = func)
}

myfunctions <- branchFunction()

# -----------------------------------------------------------------------------
# RUN
xmlEventParse(file = file.path(dataDir, fileName),
              handlers = NULL,
              branches = myfunctions)

Test File and output

~/dev/stackoverflow/47162861/data/medsamp2015.xml

$ ll                                                            
total 2128
drwxr-xr-x@ 7 hidden  staff   238B Nov 10 11:05 .
drwxr-xr-x@ 9 hidden  staff   306B Nov 10 11:11 ..
-rw-r--r--@ 1 hidden  staff    32K Nov 10 11:12 1.csv
-rw-r--r--@ 1 hidden  staff    20K Nov 10 11:12 2.csv
-rw-r--r--@ 1 hidden  staff    23K Nov 10 11:12 3.csv
-rw-r--r--@ 1 hidden  staff    37K Nov 10 11:12 4.csv
-rw-r--r--@ 1 hidden  staff   942K Nov 10 11:05 medsamp2015.xml

Runtime Output

> ./invoke.sh > runtime.txt
+ R --no-save -q --slave --args https://www.nlm.nih.gov/databases/dtd medsamp2015.xml
Loading required package: XML

File: runtime.txt

R: Memory Management during xmlEventParse of Huge (>20GB) files

1 Answers1

Launch Script

R File - 47162861.R

Test File and output

Runtime Output

R File - `47162861.R`