3

I have been trying to get an XML file into a dataframe but am struggling, I have tried a few approaches and this is where I am at.

My XML file looks like 20k segments of this:

<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
  <root>
    <set id="ASSAY_WORKFLOW">
      <row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
        <field name="ASSAY_ID">5244</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="WORKFLOW_STATE">0</field>
        <field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
        <field name="WORKFLOW_LAST_STEP_ID">17896</field>
      </row>
    </set>
    <set id="WORKFLOW_STEPS">
      <row state="NEW" pk="17896">
        <field name="STEP_ID">17896</field>
        <field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
        <field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
        <field name="STEP_DATE_TZ">America/New_York</field>
        <field name="USER_ID">303</field>
        <field name="USER_FULL_NAME">Ron Swanson</field>
        <field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
        <field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
        <field name="ROLE_ID">1</field>
      </row>
    </set>
    <set id="WORKFLOW_STEP_VARIABLES">
      <row state="NEW" pk="17896&#xA4;nextActorId">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">nextActorId</field>
        <field name="VALUE">2</field>
      </row>
      <row state="NEW" pk="17896&#xA4;validateToPendingValidation">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">validateToPendingValidation</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;signToPendingSignature">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">signToPendingSignature</field>
        <field name="VALUE">false</field>
      </row>
      <row state="NEW" pk="17896&#xA4;comment">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">comment</field>
        <field name="VALUE">GH-VAP, IgG1 repeats,</field>
      </row>
      <row state="NEW" pk="17896&#xA4;actionDelegateU">
        <field name="STEP_ID">17896</field>
        <field name="VARIABLE_ID">actionDelegateU</field>
        <field name="VALUE">directDelegateU</field>
      </row>
    </set>
    <set id="WORKFLOW_ROLE_NAMES">
      <row state="NEW" pk="1">
        <field name="ROLE_ID">1</field>
        <field name="LANGUAGE_ID">2</field>
        <field name="DESCRIPTION">Author</field>
      </row>
    </set>
  </root>
</data>

For each root node there are child elements that have the same tag "field" with attribute "name". The values of which identify the value and name of the column I want it under in my data frame.

I can get everything out with this:

library(XML)

xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")

username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)

I planned to create a data.frame with them but the vectors are all slighty different lengths which I assume is because there are probably some missing elements in some of the root nodes. Can someone clue me in on how to handle this?

Thanks

Parfait
  • 104,375
  • 17
  • 94
  • 125
r3vdev
  • 315
  • 3
  • 10

2 Answers2

2

For children elements that may or may not be present, consider iterating across the parent node, here being <row>, by node position. Then, build lists of dataframes casting each column to the needed value or zero length string using XPath's concat to always return a result for equal length columns. Finally rbind all dataframes of list for final.

row_length <- length(xpathSApply(xmlfilealt, "//row"))

df_List <- lapply(seq(row_length), function(i){

  data.frame(
    username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
    title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
    state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
    actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
    actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
    stringsAsFactors = FALSE
  )

})

# CONCATENATE ALL DFs 
finaldf <- do.call(rbind, df_List)

# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))

finaldf
#      username                 title        state               actionDate  actor
# 1 Ron Swanson                     2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2        <NA>                 false         <NA>                     <NA>   <NA>
# 3        <NA>                 false         <NA>                     <NA>   <NA>
# 4        <NA> GH-VAP, IgG1 repeats,         <NA>                     <NA>   <NA>
# 5        <NA>       directDelegateU         <NA>                     <NA>   <NA>
# 6        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 7        <NA>                  <NA>         <NA>                     <NA>   <NA>
# 8        <NA>                  <NA>         <NA>                     <NA>   <NA>
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks Parfait! Your answer is easier to understand and tought me some new things, but I choose Dave2e answer because I started your code at 9PM, went to bed at 11PM, and it finished some time between then and 8AM :P! So it worked great but it took a while. What do you use to reformat the XML by the way? I tried using notepad++. – r3vdev Feb 10 '18 at 13:48
  • Very interesting this took hours and `xml2` took minutes! Had you gone with `xml2` I would have advised transforming your XML into simpler format using XSLT and then import into [dataframe with xml2's extension, xslt](https://stackoverflow.com/a/48247014/1422451) . – Parfait Feb 10 '18 at 16:52
1

This xml is very inconsistent and difficult to parse out in a consistent manner. I prefer to use the xml2 package, since I find the syntax easier to use.

library(xml2)

# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")

# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})

#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)

The advantage of this approach is the expectation of each root node should contain each desired field. xml2's xml_find_first will return NA if that field/node is missing thus all the results will be the same size.

I left out the value field since there were at least fields with the value attribute and it was not clear if one or all of those values were required.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • "This xml is very inconsistent and difficult to parse out in a consistent manner." Well that makes me feel better :P. Thanks Dave2e, this worked and only took a couple minutes to run. – r3vdev Feb 10 '18 at 13:44