Consider the following XML
example
library(xml2)
myxml <- read_xml('
<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>
')
Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name
and hobby
.
However, as you see, there is an alignment problem because hobby
is missing in the second node and John has two hobbies.
in R, I know how to extract specific values one at a time, for instance using xml2
as follows:
myxml%>%
xml_find_all("//name") %>%
xml_text()
myxml%>%
xml_find_all("//hobby") %>%
xml_text()
but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a |
the two hobbies of John):
# A tibble: 2 × 3
name hobby skill
<chr> <chr> <chr>
1 John tennis|golf python
2 Robert <NA> R
In R, I would prefer a solution using xml2
and dplyr
. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.
Thanks!
EDIT: thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!