I found this question and hrbrmstr's answer: "In R, how to extracting two values from XML file, looping over 5603 files and write to table" ... which works for example with the Crude-dataset, but with my own dataset I get an error: Error in ans[[1]] : subscript out of bounds
setwd("LOCATION_OF_XML_FILES")
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
teksti <- xmlValue(doc[["//body"]])
file <- unlist(strsplit(xmlfiles[i],split=".",fixed=T))[1]
return(data.frame(file,teksti))
})
head(dat)
write.csv(dat, "tekstit_xml.csv", row.names=FALSE)
My dataset is confidential so I'm afraid I can't share it, but the structure is like this:
<?xml version="1.0" encoding="UTF-8"?>
-<article> <body> flajslkfjlkjaslkjflkajlskjfasjdfjflkdsjalfjdsj
"alot of text, like a chapter of a book"
</body> </article>
If I take away the "teksti <- xmlValue(doc[["//body"]])", then the code works, but when it is included I get an error:
Error in ans[[1]] : subscript out of bounds
Can You please help me?
EDIT: I tried it with 11 files and everything went well. But with the 530 xml:s it still gives the error. The largest files have about 5000 words in them. So is it so that data.frame has a limit to it's size?
Traceback:
Error in ans[[1]] : subscript out of bounds
8 `[[.XMLInternalDocument`(doc, "//body")
7 doc[["//body"]]
6 xmlValue(doc[["//body"]])
5 FUN(X[[12L]], ...)
4 lapply(pieces, .fun, ...)
3 structure(lapply(pieces, .fun, ...), dim = dim(pieces))
2 llply(.data = .data, .fun = .fun, ..., .progress = .progress,
.inform = .inform, .parallel = .parallel, .paropts = .paropts)
1 ldply(seq(xmlfiles), function(i) {
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
teksti <- xmlValue(doc[["//body"]])
file <- unlist(strsplit(xmlfiles[i], split = ".", fixed = T))[1] ...