0

I am trying to transform an XML file into a dataframe.

Example xml file:

<games id="32134">
    <game id="3962920" xsid="0">
    <time>2016-11-26T15:30:00+00:00</time>
    <group id="33765">Roses</group>
    <hteam id="2228">BlackSavers</hteam>
    <ateam id="226150">Regeton</ateam>
    <results>
    </results>
    <server sid="126" name="reg">
        <offer id="548331136">
            <states i="0" time="2016-11-26T10:03:56+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>2.750</s1>
                <s2>3.600</s2>
                <s3>2.100</s3>
            </states>
            <states i="1" time="2016-11-25T17:05:07+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>3.000</s1>
                <s2>3.600</s2>
                <s3>2.000</s3>
            </states>
        </offer>
    </server>
    <server bid="221" name="razor">
        <offer id="548415893">
            <states i="0" time="2016-11-26T10:11:26+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>653.000</s1>
                <s2>873.600</s2>
                <s3>225.100</s3>
            </states>
            <states i="1" time="2016-11-26T10:07:39+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>323.000</s1>
                <s2>321.750</s2>
                <s3>211.050</s3>
            </states>
            <states i="2" time="2016-11-25T19:54:20+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>223.100</s1>
                <s2>322.600</s2>
                <s3>232.050</s3>
            </states>
        </offer>
    </server>
    <server bid="291" name="nagie">
        <offer id="548454059">
            <states i="0" time="2016-11-26T13:21:08+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>323.000</s1>
                <s2>123.400</s2>
                <s3>342.100</s3>
            </states>
            <states i="1" time="2016-11-26T10:07:02+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>123.000</s1>
                <s2>323.500</s2>
                <s3>342.050</s3>
            </states>
            <states i="2" time="2016-11-25T21:35:50+00:00" starting_time="2016-11-26T15:30:00+00:00">
                <s1>374.000</s1>
                <s2>349.600</s2>
                <s3>200.000</s3>
            </states>
        </offer>
    </server>
</game>
</games>

Current code:

df <- do.call("rbind", xpathApply(doc, "//game", function(m) {
data.frame(
game_id = xmlAttrs(m)["id"],
t(xpathSApply(m, "group", function(g) {
  c(
    group_id = xmlAttrs(g)["id"],
    group = xmlValue(g[["group"]])
  )
})),
t(xpathSApply(m, "server",function(b){
  sid <- xmlAttrs(b)[["sid"]]
  name <- xmlAttrs(b)[["name"]]
  xpathSApply(b, "offer",function(of){
    c(
      sid = sid,
      name = name,
      id = xmlAttrs(of)[["id"]],
      do.call(cbind, xpathApply(of, "states",function(o){
        c(s1 <- xmlValue(o[["s1"]]),
          s2 <- xmlValue(o[["s2"]]),
          s3 <- xmlValue(o[["s3"]])
        )
      }))
      )})

  })))

}))

Desired dataframe output:

Desired format

My problem is, I can't figure out how to place states in the dataframe as well. The other levels are already in, and they do work. I would only need help for the last piece.

These posts helped me a lot xml with nested siblings to data frame in R Transforming data from xml into R dataframe

Thank you!

Community
  • 1
  • 1
ponthu
  • 311
  • 1
  • 3
  • 14

2 Answers2

1

One approach might be to extract the values without worrying about the 'geometry' of the document

game_id <- as.integer(xpathSApply(doc, "//game", xmlGetAttr, "id"))
server_id <- as.integer(xpathSApply(doc, "//server", xmlGetAttr, "bid"))
offer_id <- as.integer(xpathSApply(doc, "//offer", xmlGetAttr, "id"))
s1 <- as.numeric(xpathSApply(doc, "//s1", xmlValue))

then extract the geometry of nested replicated nodes

geo <- function(elt, node) length(getNodeSet(elt, node))
offer_geo <- sapply(getNodeSet(doc, "//offer"), geo, "states")

and put things together by summing or taking the product of nested geometries

data.frame(
    game_id = rep(game_id, sum(offer_geo)),
    server_id = rep(server_id, offer_geo),
    offer_id = rep(offer_id, offer_geo),
    s1=s1)
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

You can follow (2) in the answer here: Transforming data from xml into R dataframe . The idea is to search for the deepest node, here states and then compute the ancestors using xmlParent. From that point on it is routine. For example, using just a few of the fields (you can add the rest):

library(XML)
doc <- xpathTreeParse("games.xml", useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//states", function(states) {
   offer <- xmlParent(states)
   server <- xmlParent(offer)
   game <- xmlParent(server)
   games <- xmlParent(game)
   data.frame(
     gamesId = xmlAttrs(games)[["id"]],
     gameId = xmlAttrs(game)[["id"]],
     groupid = xmlAttrs(game[["group"]])[["id"]],
     groupname = xmlValue(game[["group"]]),
     offerId = xmlAttrs(offer)[["id"]],
     states_i = as.numeric(xmlAttrs(states)[["i"]]),
     s1 = as.numeric(xmlValue(states[["s1"]])),
     s2 = as.numeric(xmlValue(states[["s2"]])),
     stringsAsFactors = FALSE)
}))

giving:

  gamesId  gameId groupid groupname   offerId states_i     s1     s2
1   32134 3962920   33765     Roses 548331136        0   2.75   3.60
2   32134 3962920   33765     Roses 548331136        1   3.00   3.60
3   32134 3962920   33765     Roses 548415893        0 653.00 873.60
4   32134 3962920   33765     Roses 548415893        1 323.00 321.75
5   32134 3962920   33765     Roses 548415893        2 223.10 322.60
6   32134 3962920   33765     Roses 548454059        0 323.00 123.40
7   32134 3962920   33765     Roses 548454059        1 123.00 323.50
8   32134 3962920   33765     Roses 548454059        2 374.00 349.60
Community
  • 1
  • 1
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • That works thank you. Although I have to read in more than 100 xml files, and the performance does matter. As you pointed out in your post, the (1) solution runs faster. Would you mind adding that solution as well in your post? Thank you – ponthu Dec 19 '16 at 14:03
  • That's definitely more work but if you are willing to do it then it's just a matter of starting at the top and running nested `xpathApply` as shown there. I would first try the solution based on (2) and assess whether the performance will be sufficient since your time is important too. – G. Grothendieck Dec 19 '16 at 14:10