I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):
<recordGroup>
<records>
<record>
<recordId>123442</recordId>
<reportingCountry>PT</reportingCountry>
<date>2020-02-20</date>
<field>
<fieldName>Gender</fieldName>
<fieldValue>F</fieldValue>
</field>
<field>
<fieldName>Age</fieldName>
<fieldValue>57</fieldValue>
</field>
<field>
<fieldName>ClinicalSymptoms</fieldName>
<fieldValue>COUGH</fieldValue>
<fieldValue>FEVER</fieldValue>
<fieldValue>O</fieldValue>
<fieldValue>RUNOS</fieldValue>
<fieldValue>SBREATH</fieldValue>
</field>
</record>
</records>
</recordGroup>
I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse
using the code below, but this seems to be even more inefficient.
value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0
xmlEventParse(xmlDoc_clean,
list(
startElement = function (name, attrs) {
tagName <<- name
},
text = function (x) {
if (nchar(x) > 0) {
if (tagName == "recordId") {
rec <<- x
} else
if (tagName == "fieldName") {
var_f <<- x
} else {
if (tagName == 'fieldValue') {
v <- x
nvar <<- nvar + 1
value_df[nvar, 1:3] <<- c(rec, var_f, v)
}
}
}
},
endElement = function (name) {
if (name == 'record') {
print(nvar)
}
}
))
I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.
Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.