1

I have been fighting with this for quite a long time and cannot get it to work, so I am posting here. I'm not an advanced R user, but I'm learning and slowly getting onward. I have not found an example from Stackoverflow that I could adapt to this, the examples seem to have a different structure with no need to loop through each higher level attribute for each node. Or that's how I understand the difference now. The question is similar to this, but the file structure is different. For now I basically used this example.

Let's say I have large amount of small XML files with the structure presented below. They have names like file1.xml, file2.xml and so on. So file1.xml would be:

<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person1">
<WORD>word1</WORD>
<WORD>word2</WORD>
<WORD>word3</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person2">
<WORD>word4</WORD>
<WORD>word5</WORD>
<WORD>word6</WORD>
</SUBNODE>
</NODE>

And then file2.xml would be:

<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person3">
<WORD>word7</WORD>
<WORD>word8</WORD>
<WORD>word9</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person4">
<WORD>word10</WORD>
<WORD>word11</WORD>
<WORD>word12</WORD>
</SUBNODE>
</NODE>

And I would like to turn these into a data frame like this:

Filename   Speaker   Word
file1      person1   word1
file1      person1   word2
file1      person1   word3
file1      person2   word4
file1      person2   word5
file1      person2   word6
file2      person3   word7
file2      person3   word8
file2      person3   word9
file2      person4   word10
file2      person4   word11
file2      person4   word12

I can get the listing of all the words into one data frame with this:

library(XML)
library(plyr)
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
    doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
    Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
    return(data.frame(Word))
})

The content of "dat" is now a word list, as it should be. But no matter what I try I cannot get other data added into it. I have tried to add things there like:

xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
    doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
    Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
    Speaker <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']", xmlGetAttr, "SPEAKER")        
    return(data.frame(Word, Speaker))
})

But then the dataframe is not correct, as it doesn't associate the right speaker with the right word.

Word    Speaker
word1   person1
word2   person2
word3   person1
word4   person2
word5   person1
word6   person2
word7   person3
word8   person4
word9   person3
word10  person4
word11  person3
word12  person4

Then I also frequently get errors like:

"Error in UseMethod("xmlValue") : 
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"

Or then I get an error that these are of different length, which they of course are, as there are fewer speakers than there are words. There are many things I have tried, but I posted here only my "most successful" approaches. I understand that I would need a function that sort of matches each word with the speaker attribute in the above node, just extracting them into their own list doesn't help, I guess now it's just luck that in this example the number of speakers and words are matching so they were put together like in the data frame above.

And then I would still need to get the filenames into one column, as they contain a piece of information that I don't have inside the XML files themselves. This is anyway the least important aspect of my question. The actual files I work with are much more complex, that's why I have in the file sort of unnecessary structures like SUBNODE TYPE, etc.

Thank you for your help!

Community
  • 1
  • 1
nikopartanen
  • 577
  • 8
  • 15

2 Answers2

3

I would maybe try looping through the files and parsing getNodeSet. I don't use ldply very often, but you could probably replace the loop with that?

xmlfiles <- list.files(pattern = "*.xml")
n <- length(xmlfiles)
dat <- vector("list", n)
for(i in 1:n){
   doc <- xmlParse(xmlfiles[i])
   nodes <- getNodeSet(doc, "//SUBNODE")
   x<- lapply(nodes, function(x){ data.frame(
     Filename = xmlfiles[i],
     Speaker= xpathSApply(x, "." , xmlGetAttr, "SPEAKER"),
     Word= xpathSApply(x, ".//WORD" , xmlValue) )})
     dat[[i]] <- do.call("rbind", x)
}
do.call("rbind", dat)
Chris S.
  • 2,185
  • 1
  • 14
  • 14
  • This works very well as well! Thank you! For some reason I get error with: "do.call("rbind", dat")" "Error in do.call("rbind", "dat") : second argument must be a list" Do you have any suggestions why this occurs? – nikopartanen Aug 15 '14 at 09:25
  • I should had checked what this do.call("rbind") -thing is. It works like charm with do.call("rbind", dat), and it was easy to apply this into my real data. Thanks! – nikopartanen Aug 15 '14 at 17:01
  • Sorry about that typo - it's fixed above – Chris S. Aug 19 '14 at 15:34
2

One possibility is to get all the relevant values (xml is I think your doc)

x = xml['//SUBNODE/@SPEAKER | //SUBNODE/WORD/text()']

find the speakers and convert everything to a simple character vector

isSpeaker = sapply(x, is, "XMLAttributeValue")
x[!isSpeaker] = sapply(x[!isSpeaker], xmlValue)
x = unlist(x, use.names=FALSE)

and then munge the result

r = rle(isSpeaker)
data.frame(Speaker=rep(x[isSpeaker], r$length[!r$value]), Word=x[!isSpeaker])

(I don't think this is robust to speakers without words, but what kind of speaker would that be?)

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112