I'm trying to convert XML to a tibble using a tidyverse pipeline. The nodes have both attrs and text.
Input:
txt <- c('<node attrA="1A" attrB="1B">text1</node>',
'<node attrA="2A" attrB="2B">text2</node>')
Desired output (as a tibble)
~attrA, ~attrB, ~text,
1A, 1B, text1,
2A, 2B, text2
I can get the xml_attr, using this handy answer: tidyverse - prefered way to turn a named vector into a data.frame/tibble
library(xml2)
library(tidyverse)
txt <- c('<node attrA="1A" attrB="1B">text1</node>',
'<node attrA="2A" attrB="2B">text2</node>')
txt %>%
map(read_xml) %>%
map(xml_attrs) %>%
map_df(bind_rows)
But that does not get text1
and text2
. I can get just the xml_text:
library(xml2)
library(tidyverse)
txt <- c('<node attrA="1A" attrB="1B">text1</node>',
'<node attrA="2A" attrB="2B">text2</node>')
txt %>%
map(read_xml) %>%
map(xml_text) %>%
unlist() %>%
tibble(text = .)
Any idea how I can combine those to get both the xml_text and the xml_attrs through a single pipeline?
I tried writing a function to take an xml node and run both xml_text and xml_attrs on it, then mapping that function, but I couldn't get that to work (something to do with the externalptrs
used by xml2?)
I think I'm really asking a question about "reusing" the thing passed on in a pipeline, so I'm guessing the answer has to do with .
as an alias.
Edit: hmmmm, perhaps the as_list
is a solution here (although I'd still like the control that calling xml_attrs and xml_text brings.
txt %>%
map(read_xml) %>%
map(xml_find_all, xpath = "//node") %>%
map(as_list)
produces
[[1]]
[[1]][[1]]
[[1]][[1]][[1]]
[1] "text1"
attr(,"attrA")
[1] "1A"
attr(,"attrB")
[1] "1B"
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[1] "text2"
attr(,"attrA")
[1] "2A"
attr(,"attrB")
[1] "2B"
which I'm betting can be made into the tibble I want (although it's beyond me right now :)