I am trying to convert this xml_file (and many other similar ones) to a data.frame in R. Desired outcome: a data.frame (or tibble, data.table, etc) with:
- One row per
Deputado
(which is the main tag/level ofxml_file
, there are 4 of those) - All variables within each Deputado should be columns.
- Neste categories with multiple values (such as
comissao
,cargoComissoes
, etc) can be ignored.
In the code below, I tried to follow Example 2 in the readme of github/.../xmltools closely, but I got the error:
...
+ dplyr::mutate_all(empty_as_na)
Error: Argument 4 must be length 4, not 39
Any help fixing this (or different strategy with complete example) would be greatly appreciated.
The code (with reproducible error) is:
file <- "https://www.camara.leg.br/SitCamaraWS/Deputados.asmx/ObterDetalhesDeputado?ideCadastro=141428&numLegislatura="
doc <- file %>%
xml2::read_xml()
nodeset <- doc %>%
xml2::xml_children()
length(nodeset) # lots of nodes!
nodeset[1] %>% # lets look at ONE node's tree
xml_view_tree()
# lets assume that most nodes share the same structure
terminal_paths <- nodeset[1] %>%
xml_get_paths(only_terminal_parent = TRUE)
terminal_xpaths <- terminal_paths %>% ## collapse xpaths to unique only
unlist() %>%
unique()
# xml_to_df (XML package based)
## note that we use file, not doc, hence is_xml = FALSE
# df1 <- lapply(xpaths, xml_to_df, file = file, is_xml = FALSE, dig = FALSE) %>%
# dplyr::bind_cols()
# df1
# xml_dig_df (xml2 package based)
## faster!
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
if(class(x) == "character") ifelse(as.character(x)!="", x, NA) else x
}
terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = doc) # use xml docs, not nodesets! I think this is because it searches the 'root'.
df2 <- terminal_nodesets %>%
purrr::map(xml_dig_df) %>%
purrr::map(dplyr::bind_rows) %>%
dplyr::bind_cols() %>%
dplyr::mutate_all(empty_as_na)