Quickly read xml file and convert to data frame

Question

I have an XML file Tags.xml structured like so

<?xml version="1.0" encoding="utf-8"?>
<tags>
  <row Id="1" TagName=".net" Count="261481" ExcerptPostId="3624959" WikiPostId="3607476" />
  <row Id="2" TagName="html" Count="710104" ExcerptPostId="3673183" />
  <row Id="3" TagName="javascript" Count="1519901" ExcerptPostId="3624960" WikiPostId="3607052" />
...
</tags>

It's relatively well structured except that some rows are missing attributes (e.g. row 2 above is missing WikiPostId). I can convert the data into a data.table (or data.frame) with the following code

library(XML)
library(data.table)

# Read XML
tagsXML <- xmlParse("Tags.xml")

# Convert to List
tagsList <- xmlToList(tagsXML)

# Each List element is a character vector.  Convert each of these into a data.table
tagsList <- lapply(tagsList, function(x) as.data.table(as.list(x)))

# Rbind all the 1-row data.tables into a single data.table
tags <- rbindlist(tagsList, use.names = T, fill = T)

This works but seems unnecessarily slow. Is there a faster way to do this, given the well structured nature of my data? I tried using xpath as recommended in this answer, but was unsuccessful. For example, if I try to extract the Id values in each row with tagsXML[["(//tags/@Id)"]] I get an error.

score 2 · Accepted Answer · answered Jan 19 '18 at 21:49

2

For attribute-centric XML, consider the undocumented, xmlAttrsToDataFrame:

df <- XML:::xmlAttrsToDataFrame(getNodeSet(tagsXML, path='//row'))

#   Id    TagName   Count ExcerptPostId WikiPostId
# 1  1       .net  261481       3624959    3607476
# 2  2       html  710104       3673183       <NA>
# 3  3 javascript 1519901       3624960    3607052

answered Jan 19 '18 at 21:49

Parfait

104,375
17
94
125

This is great! Is there any way to specifically exclude certain tags? For example, suppose ExcerptPostId has a bunch of data I don't need. What's the best way to exclude it? – Ben Jan 19 '18 at 22:02
Simply remove column from final dataframe. Normally, in xpath you would run `//row/@*[name()!='ExcerptPostId']` but not in `getNodeSet()`. – Parfait Jan 19 '18 at 22:16

Quickly read xml file and convert to data frame

1 Answers1