0

I wrote the below code to parse a simple XML file.

xmlfile <- xmlTreeParse(inFile$datapath,encoding = "UTF-8")
    xmltop = xmlRoot(xmlfile)
    singlexml <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
    singlexml_df <- as.data.frame(t(singlexml),row.names=NULL)
    indx <- sapply(singlexml_df, is.list)
    singlexml_df[indx] <- lapply(singlexml_df[indx], function(x) as.character(x))
    singlexml_df

XML :

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
    <PLANT>
        <COMMON>Bloodroot</COMMON>
        <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
        <ZONE>4</ZONE>
        <LIGHT>Mostly Shady</LIGHT>
        <PRICE>$2.44</PRICE>
        <AVAILABILITY>031599</AVAILABILITY>
    </PLANT>
    <PLANT>
        <COMMON>Columbine</COMMON>
        <BOTANICAL>Aquilegia canadensis</BOTANICAL>
        <ZONE>3</ZONE>
        <LIGHT>Mostly Shady</LIGHT>
        <PRICE>$9.37</PRICE>
        <AVAILABILITY>030699</AVAILABILITY>
    </PLANT>
</CATALOG>

And it is successfully parsed and got converted to a dataframe.

But my new requirement is to parse a nested XML.

When I am trying to parse new nested XML everything is getting combined in a single column and not correctly getting transformed to a data frame.

I want Please provide your suggestions.

Thanks

Gopal228
  • 33
  • 1
  • 7
  • Are you seeing an error message? Your second xml example is not valid xml. It contains two root nodes, ALERT and CATALOG, and xml allows only one root node. Additionally, the ALERT tag is not closed, which xml requires. – Matthew Jan 06 '16 at 08:05
  • Thanks @Matthew for spotting the error. Previous XML is a modified one. I have now updated the exact XML which I want to parse. Could you please help me out. – Gopal228 Jan 06 '16 at 10:39

1 Answers1

2

I would approach this using XPath as it is going to give you more control over what you retrieve, while relying less on specific structure. It is a much more flexible approach, and easily adapted to other inputs. In your first case (with the plants), you can do

library(XML)
plantfile <- xmlParse("plants.xml")
plant.df <- as.data.frame(t(xpathSApply(plantfile,"//PLANT",function(x) xmlSApply(x,xmlValue))))

In your later case, suppose that we want to extract a dataframe consisting of details from alert highlights:

library(XML)
alertfile <- xmlParse("alerts.xml")
alert.df <- as.data.frame(t(xpathSApply(alertfile,"//AlertHighlights/Data/Detail",function(x) xmlSApply(x,xmlValue))))

In the first case, using the XPath expression "//PLANT" retrieves all the PLANT nodes from your file (no matter how deep they occur).

In the second case, we retrieve all Detail nodes which are children of Data nodes, which are children of AlertHighlights nodes (no matter how deep). If you are certain that these are the only Detail nodes, then we can simplify to //Detail.

If you were going to be doing this a lot, I would even wrap it into a function:

xpathToDataFrame <- function(xmlinput,expr) as.data.frame(t(xpathSApply(xmlinput,expr,function(x) xmlSApply(x,xmlValue))))

Then we can just do

plant.df <- xpathToDataFrame(xmlParse("plants.xml"),"//PLANTS")
alert.df <- xpathToDataFrame(xmlParse("alerts.xml"),"//AlertHighlights/Data/Detail")
Matthew
  • 7,440
  • 1
  • 24
  • 49
  • Thanks @Matthew . I will try the above given function . Actually I am trying to build a small shiny app which takes the expression / nodes as input and displays the information of all the children nodes under the given node. – Gopal228 Jan 07 '16 at 02:31