parsing XML files in R: Complex Structure

Question

I need to parse an XML file to a simple data frame. The issue is that the file is complex. Multiple authors can be associated with one patent, patents can take more than one tag, and author-institution associations are not necessarily one-to-one. What more? Names are split into two fields - given name and surname. I need to retrieve those without getting names mixed up across authors.

Unlike the problem described here (How to transform XML data into a data.frame?), the associations are 3-way: authors/inventors to patents, patents to tags, authors to institutions. Second, the institutions are not always reported in the same way. If all authors are affiliated with one institution, the institution is only shown once in the "aff" tag. (See example below). The usual problem of multiple authors per item remain.

The challenge is also similar to that poised here,(Parsing XML file with known structure and repeating elements), but as you can see, the layouts are quite different, to put it gently.

Madeup Data Sample

<patents>
      <patent patno="101103062330">
        <office coden="EPO"  short="Eur. Pat. Office"> European Office    </office>
        <volume>80    </volume>
        <issue printdate="2009-12-00">6    </issue>
        <numpages>13    </numpages>
        <section code="A-2D"> Filtering    </section>
        <patno>101103062330    </patno>
        <title> trapping plastic waste    </title>
        <authgrp>
          <author>
            <givenname>Endo    </givenname>
            <surname>Wake    </surname>
          </author>
          <author>
            <givenname>C.    </givenname>
            <surname>Morde    </surname>
          </author>
          <aff> University of M, USA    </aff>
        </authgrp>
        <history>
          <received date="2009-07-01"/>
          <published date="2009-07-30"/>
        </history>
        <tag tagyr="2009">
          <tagcode>B1.C2.B5    </tagcode>
          <tagcode>F4.65.F6    </tagcode>
        </tag>
        <assignment>
          <assigndate date="2009"/>
          <rightholder> university of M    </rightholder>
        </assignment>
  </patent>
      <patent patno="101103062514">
        <office coden="EPO"  short="Eur. Pat. Office"> European Office    </office>
        <issue printdate="2009-12-00">6    </issue>
        <numpages>15    </numpages>
        <section code="A-3D"> structure and dynamics    </section>
        <patno>101103062514    </patno>
        <title> separation of cascades and photon emission    </title>
        <authgrp>
          <author affref="a1 a2">
            <givenname>L.    </givenname>
            <surname>Slabsky    </surname>
          </author>
          <author affref="a1">
            <givenname>D.    </givenname>
            <surname>Volosvyev    </surname>
          </author>
          <author affref="a3">
            <givenname>G.    </givenname>
            <surname>Nonpl    </surname>
          </author>
          <aff affid="a1"> Institute of Physics,Russia    </aff>
          <aff affid="a2"> Physics Institute, St. Petersburg     </aff>
          <aff affid="a3">Technische Universiteit, Dresden    </aff>
        </authgrp>
        <history>
          <received date="2009-01-11"/>
          <published date="2009-01-31"/>
        </history>
        <tag tagyr="2009">
          <tagcode>A1.B2.C3    </tagcode>
        </tag>
        <assignment>
          <assigndate date="2009"/>
          <rightholder> Physics Inst    </rightholder>
        </assignment>
  </patent>
</patents>

I would like to get three tables from this xml file.

The first simply matches authors/inventors to their patents, the second matches patents to tags, while the third matches inventors/authors to institutions:

Table 1 Sample

`Patent Author1 Author2 Author3
101103062330 Endo Wake  C. Morde 
101103062514 L. Slabsky D.Volosyev  G. Nonpl`

The long table format is fine too.

    `Patent Author
    101103062330 Endo Wake
    101103062330 C. Morde
    101103062514  L. Slabsky
    101103062514  D.Volosyev
    101103062514  G. Nonpl`

Table 2 Sample

    `Patent Tag
    101103062330 B1.C2.B5
    101103062330 F4.65.F6
    101103062514 A1.B2.C3`

Table 3 Sample

   `Author  Institution
    Endo Wake   University of M
    C. Morde        University of M
    L. Slabsky      Institute of Physics,Russia
    D.Volosyev  Physics Institute, St. Petersburg
    G. Nonpl        Technische Universiteit, Dresden`

I tried using:

xmlfile     <- xmlInternalTreeParse("filename.xml", useInternal = T)
nodes     <- getNodeSet(xmlfile, "//patent")
authors     <- lapply(nodes, xpathSApply, ".//author", xmlValue)
patent     <- sapply(nodes, xpathSApply, ".//patent", xmlValue)

No luck. It would not resolve author names inside the groups.

I also tried:

dt1 <- ldply(xmlToList(xmlfile), data.table)

No luck. I got a table with column 1 saying "patent" and an assortment of data in column 2.

I am new to the XML package, so I would appreciate some support.

and the code you've tried so far is… – hrbrmstr Aug 26 '15 at 15:12 — hrbrmstr, Aug 26 '15 at 15:12
@hrbmstr see updates to question – user2627717 Aug 26 '15 at 15:26 — user2627717, Aug 26 '15 at 15:26

score 1 · Accepted Answer · answered Sep 02 '15 at 12:48

Try this.

Table 1

lapply(
  getNodeSet(patents, "//patent"),
  function(patent){
    data.frame(
      patent = xmlAttrs( patent )[["patno"]],
      xmlToDataFrame(
        nodes = getNodeSet(patent,".//*[contains(local-name(), 'author')]")
      ),
      stringsAsFactors = FALSE
    )
  }
)

Table 2

lapply(
  getNodeSet(patents, "//patent"),
  function(patent){
    data.frame(
      patent = xmlAttrs( patent )[["patno"]],
      tag = xpathSApply(
        patent,
        ".//tagcode",
        xmlValue
      ),
      stringsAsFactors = FALSE
    )
  }
)

Table 3

I'll leave the join up to you as homework.

lapply(
  getNodeSet(patents, "//authgrp"),
  function(autg){
    aff_df <- do.call(
      rbind,
      c(
        xpathApply(
          autg,
          ".//aff[@affid]",  # get only those with affid attr
          function(aff){
            data.frame(
              aff_id = xmlAttrs(aff)[["affid"]],
              institution = xmlValue(aff)
            )
          }
        ),
        xpathApply(
          autg,
          ".//aff[not(@affid)]",  # get only those without affid
          function(aff){
            data.frame(
              aff_id = NA,
              institution = xmlValue(aff)
            )
          }
        )
      )
    )

    authors <- getNodeSet( autg, "./author")
    aut_df <- xmlToDataFrame( nodes = authors )
    aut_df$aff_id <- lapply(
      1:length(authors)
      ,function(i){
        if(!is.null(xmlAttrs(authors[[i]])[["affref"]])){
          xmlAttrs(authors[[i]])[["affref"]]
        } else {
          NA
        }
      }
    )

    list(aff_df,aut_df)
  }
)

parsing XML files in R: Complex Structure

1 Answers1

Table 1

Table 2

Table 3