0

As example, I have the following XML code

tt = '<Nummeraanduiding>
  <identificatie>0010200000114849</identificatie>
<aanduidingRecordInactief>N</aanduidingRecordInactief>
<aanduidingRecordCorrectie>0</aanduidingRecordCorrectie>
<huisnummer>13</huisnummer>
<officieel>N</officieel>
<postcode>9904PC</postcode>
<tijdvakgeldigheid>
<begindatumTijdvakGeldigheid>2010051100000000</begindatumTijdvakGeldigheid>
</tijdvakgeldigheid>
<inOnderzoek>N</inOnderzoek>
<typeAdresseerbaarObject>Verblijfsobject</typeAdresseerbaarObject>
<bron>
<documentdatum>20100511</documentdatum>
<documentnummer>2010/NR002F</documentnummer>
</bron>
<nummeraanduidingStatus>Naamgeving uitgegeven</nummeraanduidingStatus>
<gerelateerdeOpenbareRuimte>
<identificatie>0010300000000444</identificatie>
</gerelateerdeOpenbareRuimte>
</Nummeraanduiding> '

The goal is to convert this node(Nummeraanduiding) to a data.table (or data.frame is also fine). One challenge is that I have a lot of these Nummeraanduiding nodes (millions of them).

The following code is able to process the data:

library(XML)
# This parses the doc...
doc = xmlParse(tt)
# Solution (1) - this is the most obvious solution..
XML::xmlToDataFrame(doc)
# Solution (2) - apparently converting to a list is also possible.. 
unlist(xmlToList(doc))
# Solution (3) - My own solution
data.frame(as.list(unlist(xmlToList(doc))))

Not all solutions produce the desired result... In the end only the version of Solution (3) satisfies my needs.

  1. It is in a data.frame/data.table format
  2. It contains all the child-child-nodes and has distinct names for each column
  3. It does not 'merge' the information of child-child-nodes

However, running this piece of code for all my data becomes quite slow. It took 8+ hours to complete it for a file containing 2290000 times the 'Nummeraanduiding'-node.

Do you guys know any way to speed up this process? Can my method be improved? Am I missing some useful function maybe?

zwep
  • 1,207
  • 12
  • 26
  • You could do it with xml2, e.g. `library(xml2) ; tt %>% read_xml() %>% xml_children() %>% {setNames(xml_text(.), xml_name(.))} %>% t() %>% data.frame()`, though I'd really probably leave off `t() %>% data.frame` and just `rbind` the vectors into a matrix that can be converted to a data.frame. If you want to parse the whole list element-wise, `purrr` can be very useful. – alistaire Nov 01 '16 at 06:56

1 Answers1

0

Given that each field is already on a separate line just grep them out, read what is left using read.table and convert from long to wide using tapply to produce the resulting matrix (which can be converted to a data frame or data.table if desired). Note that in read.table we bypass quote, comment and class processing. Finally, test it out to see if it is faster. No packages are used.

nms <- c("identificatie", "aanduidingRecordInactief", "aanduidingRecordCorrectie", 
"huisnummer", "officieel", "postcode", "tijdvakgeldigheid.begindatumTijdvakGeldigheid", 
"inOnderzoek", "typeAdresseerbaarObject", "bron.documentdatum", 
"bron.documentnummer", "nummeraanduidingStatus", 
"gerelateerdeOpenbareRuimte.identificatie")

rx <- paste(nms, collapse = "|")
g <- chartr("<", ">", grep(rx, readLines(textConnection(tt)), value = TRUE))

long <- read.table(text = g, sep = ">", quote = "", comment.char = "", 
  colClasses = "character")[2:3]
names(long) <- c("field", "value")
long$field <- factor(long$field, levels = nms)  # maintain order of columns
long$recno <- cumsum(long$field == "identificatie")

with(long, tapply(value, list(recno, field), c))

If all records have exactly the same set of fields, such as those listed in nms, then the last line could be replaced with this (which is likely faster):

matrix(long$value, ncol = length(nms), byrow = TRUE, dimnames = list(NULL, nms))

Another alternative to the tapply line would be to use reshape from base R or to use dcast from the reshape2 package.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341