6

I need to find and combine information in some huge XML-files (doc <- xmlInternalTreeParse(file.name, useInternalNodes=TRUE, trim=TRUE) causes my 16GB computer to start swapping to disk before finishing), and have followed the good instructions on http://www.omegahat.org/RSXML/Overview.html.

Adding to the example from there, this is more or less what my file looks like:

<?xml version="1.0" ?>
<TABLE>
  <SCHOOL>
    <NAME> School1 </NAME>
    <GRADES>
      <STUDENT> Fred </STUDENT>
      <TEST1> 66 </TEST1>
      <TEST2> 80 </TEST2>
      <FINAL> 70 </FINAL>
    </GRADES>
    <TEAMS>
      <SOCCER> SoccerTeam1 </SOCCER>
      <HOCKEY> HockeyTeam1 </HOCKEY>
    </TEAMS>
  </SCHOOL>
  <SCHOOL>
    <NAME> School2 </NAME>
    <GRADES>
      <STUDENT> Wilma </STUDENT>
      <TEST1> 97 </TEST1>
      <TEST2> 91 </TEST2>
      <FINAL> 98 </FINAL>
    </GRADES>
    <TEAMS>
      <SOCCER> SoccerTeam2 </SOCCER>
    </TEAMS>
  </SCHOOL>
</TABLE>

I need to list students per school with hockey-team, and the team-names. The wanted output from the example should be "Fred", "HockeyTeam1", "School1". The real example have thousands of "schools", "hockey teams" and "players".

How can I use xmlEventParse to parse the files to extract the info? I tried to extract all text-fields from the files, but after hours of waiting there was still no output. Note: The real files are more nested than this, so it is not enought to step fixed levels to find info.

Chris
  • 2,256
  • 1
  • 19
  • 41
  • 2
    The XML package has flexible [event parsing capabilities](http://stackoverflow.com/questions/16676798/storing-xml-node-values-with-rs-xmleventparse-for-filtered-output/16681768#16681768), [2](http://stackoverflow.com/questions/20719555/random-sampling-from-xml-file-into-data-frame-in-r/20732228#20732228), [3](http://stackoverflow.com/questions/7536754/storing-specific-xml-node-values-with-rs-xmleventparse/7547433#7547433) to iterate through large files – Martin Morgan Mar 25 '14 at 19:57
  • Hi, yes I looked at those questions and tried to apply them here, but my XML-skills are too poor to make the connection. – Chris Mar 25 '14 at 20:01

2 Answers2

8

We'll use the XML package

library(XML)

and create a closure that contains a function to handle the 'SCHOOL' node, as well as two helper functions to retrieve results when done. The SCHOOL function is invoked on each SCHOOL node. If it finds a hockey team, it uses the /SCHOOL/NAME/text() as a 'key', and the /SCHOOL/TEAMS/HOCKEY/text() and //STUDENT/text() (or /SCHOOL/GRADES/STUDENT/text()) as values. A message is printed for every 100 (by default) schools with hockey teams, so that there's some indication of progress. The 'get' function is used after the fact to retrieve the result.

teams <- function(progress=1000) {
    res <- new.env(parent=emptyenv())   # for results
    it <- 0L                            # iterator -- nodes visited
    list(SCHOOL=function(elt) {
        ## handle 'SCHOOL' nodes 
        if (getNodeSet(elt, "not(/SCHOOL/TEAMS/HOCKEY)"))
            ## early exit -- no hockey team
            return(NULL)
        it <<- it + 1L
        if (it %% progress == 0L)
            message(it)
        school <- getNodeSet(elt, "string(/SCHOOL/NAME/text())") # 'key'
        res[[school]] <-
            list(team=getNodeSet(elt,
                   "normalize-space(/SCHOOL/TEAMS/HOCKEY/text())"),
                 students= xpathSApply(elt, "//STUDENT", xmlValue))
    }, getres = function() {
        ## retrieve the 'res' environment when done
        res
    }, get=function() {
        ## retrieve 'res' environment as data.frame
        school <- ls(res)
        team <- unlist(eapply(res, "[[", "team"), use.names=FALSE)
        student <- eapply(res, "[[", "students")
        len <- sapply(student, length)
        data.frame(school=rep(school, len), team=rep(team, len),
                   student=unlist(student, use.names=FALSE))
    })
}

We use the function as

branches <- teams()
xmlEventParse("event.xml", handlers=NULL, branches=branches)
branches$get()
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
-2

I think that you can work with normal data frame. So for this:

f=xmlParse('file.xml')
df=xmlToDataFrame(f)

then you have a data frame, make some conditions to filter objects. or, you want to work with xml tree, attributes and value?

r=xmlRoot(f)

call out r or every branch like r[[1]][[1]] will give you <NAME> School1 </NAME>

C Doan
  • 91
  • 2
  • 9