1

I'm trying to convert XML to a tibble using a tidyverse pipeline. The nodes have both attrs and text.

Input:

txt <- c('<node attrA="1A" attrB="1B">text1</node>',
         '<node attrA="2A" attrB="2B">text2</node>')

Desired output (as a tibble)

~attrA, ~attrB, ~text,
1A, 1B, text1,
2A, 2B, text2

I can get the xml_attr, using this handy answer: tidyverse - prefered way to turn a named vector into a data.frame/tibble

library(xml2)
library(tidyverse)

txt <- c('<node attrA="1A" attrB="1B">text1</node>',
         '<node attrA="2A" attrB="2B">text2</node>')

txt %>% 
  map(read_xml) %>% 
  map(xml_attrs) %>% 
  map_df(bind_rows) 

But that does not get text1 and text2. I can get just the xml_text:

library(xml2)
library(tidyverse)

txt <- c('<node attrA="1A" attrB="1B">text1</node>',
         '<node attrA="2A" attrB="2B">text2</node>')

txt %>% 
  map(read_xml) %>%
  map(xml_text) %>%
  unlist() %>% 
  tibble(text = .)

Any idea how I can combine those to get both the xml_text and the xml_attrs through a single pipeline?

I tried writing a function to take an xml node and run both xml_text and xml_attrs on it, then mapping that function, but I couldn't get that to work (something to do with the externalptrs used by xml2?)

I think I'm really asking a question about "reusing" the thing passed on in a pipeline, so I'm guessing the answer has to do with . as an alias.

Edit: hmmmm, perhaps the as_list is a solution here (although I'd still like the control that calling xml_attrs and xml_text brings.

txt %>% 
  map(read_xml) %>% 
  map(xml_find_all, xpath = "//node") %>% 
  map(as_list)

produces

[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[1] "text1"

attr(,"attrA")
[1] "1A"
attr(,"attrB")
[1] "1B"


[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[1] "text2"

attr(,"attrA")
[1] "2A"
attr(,"attrB")
[1] "2B"

which I'm betting can be made into the tibble I want (although it's beyond me right now :)

jameshowison
  • 151
  • 8

1 Answers1

1

Well, this seems to work, uses the . synomyn to call the two different methods on the nodes.

txt %>% 
  map(read_xml) %>% 
  map(xml_find_all, xpath = "//node") %>% 
  tibble(text = map_chr(., xml_text),
         # attr as a tibble column, pluck needed to unpack one
         # level of list.
         xml_attr_col = map(., xml_attrs) %>% map(pluck,1)) %>% 
  select(-1) %>%  # drop original node column somehow created by tibble
  # groovy function to unnest that tibble column
  unnest_wider(xml_attr_col)
jameshowison
  • 151
  • 8