0

I'm new to R and trying to parse over 100k xml files into 1 csv file. I used a formula from a previous question asked and it works perfectly if I state the specific column name. My xml files are rather long to specifically write them out so I want to add all the columns into data frame without explicitly writing the column headings. I'm using this exact same formula except I have more rows listing column names rather than just zip code and amount.

require(XML)
require(plyr)
setwd("LOCATION_OF_XML_FILES")
xmlfiles <- list.files(pattern = "*.xml")

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  zipcode <- xmlValue(doc[["//ZipCode"]])
  amount <- xmlValue(doc[["//AwardAmount"]])
  return(data.frame(zip = zipcode, amount = amount))
}) 
write.csv(dat, "zipamount.csv", row.names=FALSE)

1 Answers1

0

Hopefully the xmlToDataFrame() function will do what you want. It assumes the XML document is a a root node whose child nodes are a sequence of records and that each record has simple elements. Then it extracts them into a data.frame

Consider a sample XML document

<doc>
<record><a>1</a><b>2</b></record>
<record><a>10</a><b>20</b><c>bob</c></record>
<record><a>20</a><b>30</b></record>
</doc>

xmlToDataFrame() returns

   a  b    c
1  1  2 <NA>
2 10 20  bob
3 20 30 <NA>
  • It would help to know the package needed to get this function, as it is not in base R. – r2evans Jan 25 '22 at 02:14
  • 1
    It is the XML package which is also where xmlTreeParse() and xmlValue() reside. That's why I didn't mention it but glad you suggested making it explicit. – duncantl Jan 25 '22 at 02:42
  • Gotcha, good point. Sorry for the noise, I didn't look at the question closely enough. (It should really be explicit *there* :-) – r2evans Jan 25 '22 at 02:47
  • Sorry, I didn't add the packages used in the code. I just added them in now. For the xmlToDataframe as I understand, the code should be: doc <- xmlToDataFrame(getNodeSet(doc, "//doc"))? – Nya Dawson Jan 25 '22 at 03:40
  • Actually, xmlToDataFrame() also understands to read directly from a file - if it is in a suitable format - so ``` lapply(xmlfiles, xmlToDataFrame) ``` is the simplest way to call this. – duncantl Jan 25 '22 at 05:51
  • I get the following error: Error in [<-.data.frame (*tmp*, I, names (nodes[[I]], value = column name , : duplicate subscripts for columns Code I tried: setwd("LOCATION_OF_XML_FILES") xmlfiles <-list.files() doc <- lapply(xmlfiles, xmltoDataFrame) – Nya Dawson Jan 25 '22 at 13:55