I have been fighting with this for quite a long time and cannot get it to work, so I am posting here. I'm not an advanced R user, but I'm learning and slowly getting onward. I have not found an example from Stackoverflow that I could adapt to this, the examples seem to have a different structure with no need to loop through each higher level attribute for each node. Or that's how I understand the difference now. The question is similar to this, but the file structure is different. For now I basically used this example.
Let's say I have large amount of small XML files with the structure presented below. They have names like file1.xml, file2.xml and so on. So file1.xml would be:
<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person1">
<WORD>word1</WORD>
<WORD>word2</WORD>
<WORD>word3</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person2">
<WORD>word4</WORD>
<WORD>word5</WORD>
<WORD>word6</WORD>
</SUBNODE>
</NODE>
And then file2.xml would be:
<NODE>
<SUBNODE TYPE="WORDS" SPEAKER="person3">
<WORD>word7</WORD>
<WORD>word8</WORD>
<WORD>word9</WORD>
</SUBNODE>
<SUBNODE TYPE="WORDS" SPEAKER="person4">
<WORD>word10</WORD>
<WORD>word11</WORD>
<WORD>word12</WORD>
</SUBNODE>
</NODE>
And I would like to turn these into a data frame like this:
Filename Speaker Word
file1 person1 word1
file1 person1 word2
file1 person1 word3
file1 person2 word4
file1 person2 word5
file1 person2 word6
file2 person3 word7
file2 person3 word8
file2 person3 word9
file2 person4 word10
file2 person4 word11
file2 person4 word12
I can get the listing of all the words into one data frame with this:
library(XML)
library(plyr)
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
return(data.frame(Word))
})
The content of "dat" is now a word list, as it should be. But no matter what I try I cannot get other data added into it. I have tried to add things there like:
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
Word <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']/WORD", xmlValue)
Speaker <- xpathSApply(doc, "//SUBNODE[@TYPE='WORDS']", xmlGetAttr, "SPEAKER")
return(data.frame(Word, Speaker))
})
But then the dataframe is not correct, as it doesn't associate the right speaker with the right word.
Word Speaker
word1 person1
word2 person2
word3 person1
word4 person2
word5 person1
word6 person2
word7 person3
word8 person4
word9 person3
word10 person4
word11 person3
word12 person4
Then I also frequently get errors like:
"Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('XMLInternalDocument', 'XMLAbstractDocument')"
Or then I get an error that these are of different length, which they of course are, as there are fewer speakers than there are words. There are many things I have tried, but I posted here only my "most successful" approaches. I understand that I would need a function that sort of matches each word with the speaker attribute in the above node, just extracting them into their own list doesn't help, I guess now it's just luck that in this example the number of speakers and words are matching so they were put together like in the data frame above.
And then I would still need to get the filenames into one column, as they contain a piece of information that I don't have inside the XML files themselves. This is anyway the least important aspect of my question. The actual files I work with are much more complex, that's why I have in the file sort of unnecessary structures like SUBNODE TYPE, etc.
Thank you for your help!