I have an XML file Tags.xml structured like so
<?xml version="1.0" encoding="utf-8"?>
<tags>
<row Id="1" TagName=".net" Count="261481" ExcerptPostId="3624959" WikiPostId="3607476" />
<row Id="2" TagName="html" Count="710104" ExcerptPostId="3673183" />
<row Id="3" TagName="javascript" Count="1519901" ExcerptPostId="3624960" WikiPostId="3607052" />
...
</tags>
It's relatively well structured except that some rows are missing attributes (e.g. row 2 above is missing WikiPostId). I can convert the data into a data.table (or data.frame) with the following code
library(XML)
library(data.table)
# Read XML
tagsXML <- xmlParse("Tags.xml")
# Convert to List
tagsList <- xmlToList(tagsXML)
# Each List element is a character vector. Convert each of these into a data.table
tagsList <- lapply(tagsList, function(x) as.data.table(as.list(x)))
# Rbind all the 1-row data.tables into a single data.table
tags <- rbindlist(tagsList, use.names = T, fill = T)
This works but seems unnecessarily slow. Is there a faster way to do this, given the well structured nature of my data? I tried using xpath as recommended in this answer, but was unsuccessful. For example, if I try to extract the Id values in each row with tagsXML[["(//tags/@Id)"]]
I get an error.