Python/R: generate dataframe from XML when not all nodes contain all variables?

Question

Consider the following XML example

library(xml2)

myxml <- read_xml('
<data>
  <obs ID="a">
  <name> John </name>
  <hobby> tennis </hobby>
  <hobby> golf </hobby>
  <skill> python  </skill>
  </obs>
  <obs ID="b">
  <name> Robert </name>
  <skill> R </skill>
  </obs>
  </data>
')

Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name and hobby.

However, as you see, there is an alignment problem because hobby is missing in the second node and John has two hobbies.

in R, I know how to extract specific values one at a time, for instance using xml2 as follows:

myxml%>% 
  xml_find_all("//name") %>% 
  xml_text()

myxml%>% 
  xml_find_all("//hobby") %>% 
  xml_text()

but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a | the two hobbies of John):

# A tibble: 2 × 3
    name           hobby            skill
   <chr>           <chr>            <chr>
1   John          tennis|golf       python
2 Robert            <NA>            R

In R, I would prefer a solution using xml2 and dplyr. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.

Thanks!

EDIT: thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!

Thx. Mabye sth in the veins of `myxml %>% xml_find_all("/data/obs") %>% map(function(x) sapply(c("name","hobby"), function(y) xml_text(xml_find_first(x,y)))) %>% do.call(rbind, .)`? — lukeA, May 28 '17 at 18:32
maybe can you put this as a solution and explain what you are doing with the `sapply`? thanks again~ — ℕʘʘḆḽḘ, May 28 '17 at 18:35
yw. I will add is as an answer if no better options pops up. Somehow it feels like it's a solution, but not really a good one. Let's wait... — lukeA, May 28 '17 at 19:20
I edited the question to be more broad. A solution in Python is also good — ℕʘʘḆḽḘ, May 30 '17 at 12:05
[Here](https://stackoverflow.com/questions/33446888/r-convert-xml-data-to-data-frame) is an imho interesting post, which has got an R and a Python solution. (E.g. I wonder if this all can be mimicked using the `reticulate` package & RStudio. hmm) — lukeA, May 30 '17 at 12:27
thx but not sure this solution works when there are missing nodes like this. — ℕʘʘḆḽḘ, May 30 '17 at 12:47

GGamba · Answer 1 · 2017-05-31T14:48:04.033

4

A general R solution that does not require to hardcode the variables.
Using xml2 and tidyverse's purrr:

library(xml2)
library(purrr)

myxml %>% 
  xml_find_all('obs') %>%      
  # Enter each obs and return a df
  map_df(~{

    # Scan names
    node_names <- .x %>% 
      xml_children() %>% 
      xml_name() %>%
      unique()        

    # Remember ob
    ob <- .x

    # Enter each node
    map(node_names, ~{

      # Find similar nodes
      node <- xml_find_all(ob, .x) %>%
        xml_text(trim = TRUE) %>%
        paste0(collapse = '|') %>% 
        'names<-'(.x)
        # ^ we need to name the element to 
        #   overwrite it with its 'sibilings'

    }) %>% 
      # Return an 'ob' vector
      flatten()        
  })

#> # A tibble: 2 Ã— 3
#>     name       hobby  skill
#>    <chr>       <chr>  <chr>
#> 1   John tennis|golf python
#> 2 Robert        <NA>      R

What it does:

It 'enters' each obs, find and store the node names in that obs.
For each node find all the similar node in the obs, collapse them and store in a list.
Flattens the list, overwriting elements with the same name.
rbind (implicit in map_df()) each 'flatted' list into the resulting data.frame.

Data:

myxml <- read_xml('
                  <data>
                  <obs ID="a">
                  <name> John </name>
                  <hobby> tennis </hobby>
                  <hobby> golf </hobby>
                  <skill> python  </skill>
                  </obs>
                  <obs ID="b">
                  <name> Robert </name>
                  <skill> R </skill>
                  </obs>
                  </data>
                  ')

edited May 31 '17 at 14:48

answered May 31 '17 at 08:45

GGamba

13,140
3
38
47

hi ! thanks for the answer. Why do you think it will not perform well with multiple duplicates? it is indeed the case in my data.... – ℕʘʘḆḽḘ May 31 '17 at 14:34
1

I guess it's more an inefficiency then a poor performance issue. For each node, (eg `hobby`), it will find all its similar nodes, (`xml_find_all('hobby')`). And then only keep the last anyway. – GGamba May 31 '17 at 14:45
1

Actually, now that I think about it, a simple `unique()` on the `node_names` would remove the issue.. silly me.. – GGamba May 31 '17 at 14:47
hi ! just a few remarks. I dont understand the syntax for `node_names <- .x` (why is there a dot?) and ( `'names<-'(.x)` what does that mean? ) – ℕʘʘḆḽḘ Jun 02 '17 at 00:51
also, why the tilda in `map(node_names, ~{`. man, your code is too refined for a noobie like me :D – ℕʘʘḆḽḘ Jun 02 '17 at 00:52
summoning @ggamba – ℕʘʘḆḽḘ Jun 02 '17 at 01:01
2

in a `purrr::map*` function and many others in same package, `~` is a shortcut for `function(x)` and in the expression following it, `.` or `.x` represents the `x` argument. `purrr` provides great alternatives to `*apply`, `map`, `reduce` and other functional programming base R functions. You can read [more about purrr here](http://purrr.tidyverse.org/) – HubertL Jun 02 '17 at 19:09
@GGamba Thanks, it helped me a lot!. I don't understand the 'names<-' (.x) though, and why it have to be flattened at the end. – Pablo Olmos de Aguilera C. Mar 26 '20 at 15:46

lukeA · Answer 2 · 2017-05-30T22:40:28.413

1

In R, I'd probably use

library(XML)
lst <- xmlToList(xmlParse(myxml)[['/data']])
(df <- data.frame(t(sapply(lst, function(x) {
  c(x['name'], hobby=paste0(x[which(names(x)=='hobby')], collapse="|"))
}))) )
#       name           hobby
# 1    John   tennis | golf 
# 2  Robert

and maybe do some polishing using df[df==""] <- NA and trimws() to remove whitespaces.

Or:

library(xml2)
library(dplyr)
`%|||%` <- function (x, y) if (length(x)==0) y else x 
(df <- data_frame(
  names = myxml %>% 
    xml_find_all("/data/obs/name") %>% 
    xml_text(trim=TRUE), 
  hobbies = myxml %>% 
    xml_find_all("/data/obs") %>% 
    lapply(function(x) xml_text(xml_find_all(x, "hobby"), T) %|||% NA_character_)
))
# # A tibble: 2 × 2
#    names   hobbies
#    <chr>    <list>
# 1   John <chr [2]>
# 2 Robert <chr [1]>

edited May 30 '17 at 22:40

answered May 30 '17 at 12:20

lukeA

53,097
5
97
100

thanks! can you just please explain a bit what you are doing with the `sapply` stuff? – ℕʘʘḆḽḘ May 30 '17 at 12:34
I mean, can you just decompose a bit what you are doing here? `(df <- data.frame(t(sapply(lst, function(x) { c(x['name'], hobby=paste0(x[which(names(x)=='hobby')], collapse="|"))`? thanks! – ℕʘʘḆḽḘ May 30 '17 at 13:52
@Noobie `sapply` loops through the list and grabs the values named _name_ and named _hobby_. I used `which` because there can be multiple hobbies. Each iteration returns a vector of names and pipe-seperated hobbies. – lukeA May 30 '17 at 21:33
1

@Noobie However, if your xml is very big, using a list is probably not the best way. Maybe something like `\`%|||%\` <- function (x, y) { if (length(x)==0) y else x }; data_frame(names = myxml %>% xml_find_all("/data/obs/name") %>% xml_text(trim=TRUE), hobbies = myxml %>% xml_find_all("/data/obs") %>% lapply(function(x) xml_text(xml_find_all(x, "hobby"), T) %|||% NA_character_))` to get back to xml2 and dplyr? I'd rather store the hobbies in a list rather then pipe seperated... – lukeA May 30 '17 at 21:34
thanks! do you mind adding this solution to your previous solution? it is hard to read right now. btw what the hell is ``%|||%` <-` ? :D never seen that in my noob R life! – ℕʘʘḆḽḘ May 30 '17 at 22:37
1

@Noobie You are right, just did so; I stole that one shamelessly from `purrr::\`%||%\`` (<- works for NULL values, but I needed one for 0-lengths). But I guess it's more of less a show-off. ;-) – lukeA May 30 '17 at 22:42
the `xml2` solution is very elegant, but the dataframe now contains a list of characters? how can you get exactly the output desired? thanks! – ℕʘʘḆḽḘ Jun 02 '17 at 00:55
invoking @lukeA – ℕʘʘḆḽḘ Jun 02 '17 at 01:01
1

@Noobie: I deleted previous comment, I dunno how I came up with it. Just do `df$hobbies <- sapply(df$hobbies,paste,collapse=",")`. – lukeA Jun 02 '17 at 11:43

Chris S. · Answer 3 · 2017-05-30T22:28:06.153

XML

Create a function that can handle missing or multiple nodes, and then apply that to the obs nodes. I added the id column so you can see how to use xmlGetAttr too (use "." for the obs node and the leading "." on other nodes so its relative to that current node in the set).

xpath2 <-function(x, ...){
    y <- xpathSApply(x, ...)
    ifelse(length(y) == 0, NA,  paste(trimws(y), collapse=", "))
}  
obs <- getNodeSet(doc, "//obs")   
data.frame( id = sapply(obs, xpath2, ".", xmlGetAttr, "ID"),
          name = sapply(obs, xpath2, ".//name", xmlValue),
       hobbies = sapply(obs, xpath2, ".//hobby", xmlValue),
         skill = sapply(obs, xpath2, ".//skill", xmlValue))

  id   name      hobbies  skill
1  a   John tennis, golf python
2  b Robert         <NA>      R

xml2

I don't use xml2 very often, but maybe get the obs nodes and then apply xml_find_all if there are duplicate tags instead of using xml_find_first.

obs <-  xml_find_all(myxml, "//obs")  
lapply(obs, xml_find_all, ".//hobby")

data_frame(
     name = xml_find_first(obs, ".//name") %>% xml_text(trim=TRUE),
  hobbies = sapply(obs, function(x)  paste(xml_text( xml_find_all(x, ".//hobby"), trim=TRUE), collapse=", " ) ),
    skill = xml_find_first(obs, ".//skill") %>% xml_text(trim=TRUE)
)

# A tibble: 2 x 3
    name      hobbies  skill
   <chr>        <chr>  <chr>
1   John tennis, golf python
2 Robert                   R

I tested both methods using the medline17n0853.xml file at the NCBI ftp. This is a 280 MB file with 30,000 PubmedArticle nodes, and the XML package took 102 seconds to parse pubmed ids, journals and combine multiple publication types. The xml2 code ran for 30 minutes and then I killed it, so that may not be the best solution.

thanks Chris, this is great. However, I was asking for a `xml2` solution because it appears to be more robust to memory leaks and my XML is very large. Can you adapt your code using this similar package? — ℕʘʘḆḽḘ, May 30 '17 at 19:57
You may be trying to avoid a problem that doesn't exist, except maybe for people who use Windows and loop through XML files to create new xml docs for each node? — Chris S., May 30 '17 at 20:26
thanks. I am trying to understand why your xml2 solution is not efficient. — ℕʘʘḆḽḘ, May 30 '17 at 22:32

piRSquared · Accepted Answer · 2017-05-31T19:58:09.563

pandas

import pandas as pd
from collections import defaultdict
import xml.etree.ElementTree as ET


xml_txt = """<data>
  <obs ID="a">
  <name> John </name>
  <hobby> tennis </hobby>
  <hobby> golf </hobby>
  <skill> python  </skill>
  </obs>
  <obs ID="b">
  <name> Robert </name>
  <skill> R </skill>
  </obs>
  </data>"""

etree = ET.fromstring(xml_txt)

def obs2series(o):
    d = defaultdict(list)
    [d[c.tag].append(c.text.strip()) for c in o.getchildren()];
    return pd.Series(d).str.join('|')

pd.DataFrame([obs2series(o) for o in etree.findall('obs')])

         hobby    name   skill
0  tennis|golf    John  python
1          NaN  Robert       R

How It Works

build an element tree from the string. Otherwise do something like et = ET.parse('my_data.xml')
etree.findall('obs') returns a list of elements within the xml structure that are 'obs' tags
I pass each of these to a pd.Series constructor obs2series
Within obs2series I loop through all child nodes in one 'obs' element.
defaultdict defaults to a list meaning I can append to a value even if the key hasn't been seen before.
I end up with a dictionary of lists. I pass this to pd.Series to get a series of lists.
Using pd.Series.str.join('|') I convert this to a series of strings as I wanted.
My list comprehension in the beginning that looped over observations is now a list of series and ready to passed to the pd.DataFrame constructor.

thanks man! nice one. do you mind explaining a bit what you are doing in the function? I am not super familiar with `xml.etree` — ℕʘʘḆḽḘ, May 31 '17 at 11:58
thanks do you think this is robust to many duplicates or missing nodes? is the performance good? — ℕʘʘḆḽḘ, May 31 '17 at 19:52
@Noobie it's robust for many duplicates as the lists just get longer and the join just joins them. I have to loop over all child nodes in each observation. I've wrapped both in comprehensions. Performance should be decent, no money back guarantees or anything. BTW, I've updated the function. — piRSquared, May 31 '17 at 19:58

Python/R: generate dataframe from XML when not all nodes contain all variables?

4 Answers4

What it does:

Data:

XML

xml2

Linked