3

As I am rather new to R, I am trying to learn how I can extract two values from a XML file and loop over 5603 other (small, <2kb) XML files in my working directory.

I have been reading a lot of topics on 'looping', but find this rather confusing - especially because it seems that looping over XML files is different from looping over other files, correct?

I am using online data in XML structure.

For each XML file I want to write the "ZipCode" and "AwardAmount" to a table.

Running the following code I did retrieve the ZipCode and AwardAmount, but only from the very first file. How can I write a proper loop and write it to a table?

xmlfiles=list.files(pattern="*.xml")
for (i in 1:length(xmlfiles)){
    doc= xmlTreeParse("xmlfiles[i]", useInternal=TRUE)
    zipcode<-xmlValue(doc[["//ZipCode"]])
    amount<-xmlValue(doc[["//AwardAmount"]])
}

Does anyone has some suggestions?

wake_wake
  • 1,332
  • 2
  • 19
  • 46
  • 1
    Well `"xmlfiles[i]"` is definitely not going to work. Try creating your file names using `paste(xmlfiles, seq_along(xmlfiles), sep = "")` – James King Apr 30 '14 at 00:36

2 Answers2

4

This might work for you. I got rid of the for loop and went with sapply.

xmlfiles <- list.files(pattern = "*.xml")
txtfiles <- gsub("xml", "txt", xmlfiles, fixed = TRUE)

txtfiles is a set of new file names to be used as the output file for each run.

sapply(seq(xmlfiles), function(i){

  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  zipcode <- xmlValue(doc[["//ZipCode"]])
  amount <- xmlValue(doc[["//AwardAmount"]])
  DF <- data.frame(zip = zipcode, amount = amount)
  write.table(DF, quote = FALSE, row.names = FALSE, file = txtfiles[i])

})

Please, let me know if there are issues when you run it.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
2

Slightly different approach to Richard's (only slightly). Used ldply to make a data frame before writing it out to a file. You should select his for the answer since the "guts" of the ldply function is his, but this just shows an alternate way of doing it (assuming you want one file vs many files):

setwd("LOCATION_OF_XML_FILES")

xmlfiles <- list.files(pattern = "*.xml")

dat <- ldply(seq(xmlfiles), function(i){

  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)

  zipcode <- xmlValue(doc[["//ZipCode"]])
  amount <- xmlValue(doc[["//AwardAmount"]])

  return(data.frame(zip = zipcode, amount = amount))

})

head(dat)
##         zip amount
## 1 442420001  45000
## 2 479072114 400580
## 3 303320420  22050
## 4 326112002  12000
## 5 265066845  37000
## 6 168027000 300000

write.csv(dat, "zipamount.csv", row.names=FALSE)

You could use append=TRUE with Richard's approach and use a single file name in that write.table to do the same thing. You can also tweak the output settings of write.csv (or write.table) to get the output format you eventually want to work with.

You can also add recursive = TRUE to the list.files to go through all the subdirectories vs put all ~5,600 files into one directory (that can have performance issues on some filesystems/operating systems).

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • hrbrmstr, thanks a lot for your comments. Using your method I was able to construct one file. Your comments helped me understand much better what is going one. Thank you. – wake_wake May 01 '14 at 04:36