9

Background

I have a xml settings file that can look like this:

<level1>
 <level2>
   <level3>
    <level4name>bob</level4name>
   </level3>
 </level2>
</level1>

but there can be multiple instances of level3

<level1>
 <level2>
   <level3>
    <level4name>bob</level4name> 
   </level3>
   <level3>
    <level4name>jack</level4name> 
   </level3>
   <level3>
    <level4name>jill</level4name> 
   </level3>
 </level2>
</level1>

there can also be multiple types of level4 nodes for each level3:

   <level3>
    <level4name>bob</level4name> 
    <level4dir>/home/bob/ </level4dir> 
    <level4logical>TRUE</level4logical> 
   </level3>

In R, I load this file using

settings.xml <- xmlTreeParse(settings.file)
settings <- xmlToList(settings.xml)

I want to write a script that converts all of the values contained in level4type1 to a vector of the unique values at this level, but I am stumped trying to do this in a way that works for all of the above cases.

One of the problems is that the class(settings[['level2']]) is a list for the first two cases and a matrix for the third case.

> xmlToList(xmlTreeParse('case1.xml'))
$level2.level3.level4name
[1] "bob"
> xmlToList(xmlTreeParse('case2.xml'))
                  level2
level3.level4name "bob" 
level3.level4name "jack"
level3.level4name "jill"
> xmlToList(xmlTreeParse('case3.xml'))
       level2
level3 List,3
level3 List,1
level3 List,1

Questions

I have two questions:

  1. how can I extract a vector of the unique values of 'level4type1`

  2. is there a better way to do this?

David LeBauer
  • 31,011
  • 31
  • 115
  • 189
  • I have filed an [issue on GitHub](https://github.com/omegahat/XML/issues/1). This issue links to an alternative implementation of `xmlToList` that does not exhibit this behavior (but might contain other problems). – krlmlr Feb 06 '14 at 22:04

1 Answers1

18

Try using the internal node representation of XML and the xpath language, which is very powerful.

> xml = xmlTreeParse("case2.xml", useInternalNodes=TRUE)
> xpathApply(xml, "//level4name", xmlValue)
[[1]]
[1] "bob"

[[2]]
[1] "jack"

[[3]]
[1] "jill"
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • thanks @Martin, this is pretty much what I was looking for, but I still have two questions: if the nodes are not on the same line, R returns "\n bob \n", is there an easy way around this? and is there a way to have the function return a vector other than using `unlist`? – David LeBauer Mar 25 '11 at 02:14
  • xpath has [string functions](http://www.w3.org/TR/xpath/#section-String-Functions) for transformation; `normalize-space` strips leading / trailing white space. [this](http://stackoverflow.com/questions/3359512/is-it-possible-to-apply-normalize-space-to-all-nodes-xpath-expression-finds) provides hints, but I think you're likely out of luck except for an iterative solution. See `?xpathSApply` for returning a vector. – Martin Morgan Mar 25 '11 at 06:17
  • @David Actually, `sapply(getNodeSet(xml, "//level4name"), xpathApply, "normalize-space()")` first gets the nodes then normalizes space on each. – Martin Morgan Mar 25 '11 at 06:26
  • +1 because this one liner solved all of my parsing needs for the foreseable future! – Andrew Dec 03 '12 at 22:35