Many xml-files to one csv in R

Question

I found this question and hrbrmstr's answer: "In R, how to extracting two values from XML file, looping over 5603 files and write to table" ... which works for example with the Crude-dataset, but with my own dataset I get an error: Error in ans[[1]] : subscript out of bounds

setwd("LOCATION_OF_XML_FILES")

xmlfiles <- list.files(pattern = "*.xml")

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  teksti <- xmlValue(doc[["//body"]])
  file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 
})

head(dat)

write.csv(dat, "tekstit_xml.csv", row.names=FALSE)

My dataset is confidential so I'm afraid I can't share it, but the structure is like this:

<?xml version="1.0" encoding="UTF-8"?>
-<article> <body> flajslkfjlkjaslkjflkajlskjfasjdfjflkdsjalfjdsj 
"alot of text, like a chapter of a book"
 </body> </article>

If I take away the "teksti <- xmlValue(doc[["//body"]])", then the code works, but when it is included I get an error:

Error in ans[[1]] : subscript out of bounds

Can You please help me?

EDIT: I tried it with 11 files and everything went well. But with the 530 xml:s it still gives the error. The largest files have about 5000 words in them. So is it so that data.frame has a limit to it's size?

Traceback:

 Error in ans[[1]] : subscript out of bounds 

 8 `[[.XMLInternalDocument`(doc, "//body") 

 7 doc[["//body"]] 

 6 xmlValue(doc[["//body"]]) 

 5 FUN(X[[12L]], ...) 

 4 lapply(pieces, .fun, ...) 

 3 structure(lapply(pieces, .fun, ...), dim = dim(pieces)) 

 2 llply(.data = .data, .fun = .fun, ..., .progress = .progress, 
 .inform = .inform, .parallel = .parallel, .paropts = .paropts) 

 1 ldply(seq(xmlfiles), function(i) {
   doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
   teksti <- xmlValue(doc[["//body"]])
   file <- unlist(strsplit(xmlfiles[i], split = ".", fixed = T))[1] ...

Welcome on Stack Overflow, it there a reason you use `proxee.co` link? Please tell us: http://meta.stackoverflow.com/questions/277006/keep-or-delete-proxee-co-links-to-so — A.L, Nov 18 '14 at 08:55
The main constraint to data frame size is that the entire `data.frame` must be stored in the memory on the machine running R. As well as everything else in the environment. The error R returns (I have gotten it a few times) is definitely not an out of bounds. — vpipkt, Nov 18 '14 at 20:25

Chris S. · Accepted Answer · 2014-11-19T07:12:10.493

0

One of your files is missing the "body" tag

xmlValue(doc[["//bodyy"]])
Error in ans[[1]] : subscript out of bounds

You can use xpathSApply instead and get an empty list when the tag is missing

xpathSApply(doc, "//bodyy", xmlValue)
list()

and then add checks to your code to skip writing to a data.frame...

dat <- ldply(seq(xmlfiles), function(i){
  doc <- xmlParse(xmlfiles[i])
  teksti <- xpathSApply(doc, "//body", xmlValue)
  if(length(teksti)==0){
      print(paste("Warning: no body tag in", xmlfiles[i], i))
      teksti <- NA
  }
 file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
  return(data.frame(file,teksti)) 

})

edited Nov 19 '14 at 07:12

answered Nov 18 '14 at 22:26

Chris S.

2,185
1
14
14

Thanks! That is true, there are 36 documents in the 530 that have no "//body". But now I get a new error when running the "Warning" -code: "Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : Results must be all atomic, or all data frames". When I change ldply to lapply it works, but the data in the csv is in a peculiar format... Any ideas? – ElinaJ Nov 19 '14 at 06:32
With lapply, you should get a list of data.frames and you can combine those with do.call("rbind", dat). However, it looks like one or more files are not writing to a data.frame, so check the list elements with something like table(sapply(dat, nrow)) or table(sapply(dat, is.data.frame)) – Chris S. Nov 19 '14 at 07:02
I updated the loop in my answer so all files return a data.frame and maybe that will help? – Chris S. Nov 19 '14 at 07:14
Thanks so much! Now it's working! With do.call("rbind", dat) it was working also with lapply! Now the only problem is how to separate file and text-columns when opening in excel. The text has ",", ".", ";", "\t", so what ever I try the text is spit for some entries... – ElinaJ Nov 19 '14 at 08:29
Sure. You can change the separator option in write.table to sep="|" and try loading that. Also, I would remove tabs and/or newlines from the text in the loop like teksti <- gsub("\n|\t", " ", teksti). With any loop over lots of files, you should do lots of checking and cleaning data anyway. – Chris S. Nov 19 '14 at 15:21

Many xml-files to one csv in R

1 Answers1