1

I have a very large XML file (>70GB) from which I only need to read some segments. However, I also don't know the structure of the file, and failed to extract it due to the file's size.

I don't need to read the full file or convert it to a data frame - only to extract specific parts, but I don't know the specific format for those sequences since I don't have the structure.

I tried using xmlParse, and also using xmlEventParse based on what is suggested here: How to read large (~20 GB) xml file in R?

The code suggested there returns an empty data frame:

xmlDoc <- "Final.xml"
result <- NULL

#function to use with xmlEventParse
row.sax = function() {
    ROW = function(node){
            children <- xmlChildren(node)
            children[which(names(children) == "text")] <- NULL
            result <<- rbind(result, sapply(children,xmlValue))
          }
    branches <- list(ROW = ROW)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)

I have little experience working with XML, and so I don't fully understand the solution I tried to use.

Thanks for your help!

Community
  • 1
  • 1
yarbaur
  • 75
  • 8

0 Answers0