1

I am trying to convert this xml_file (and many other similar ones) to a data.frame in R. Desired outcome: a data.frame (or tibble, data.table, etc) with:

  • One row per Deputado (which is the main tag/level of xml_file, there are 4 of those)
  • All variables within each Deputado should be columns.
  • Neste categories with multiple values (such as comissao, cargoComissoes, etc) can be ignored.

In the code below, I tried to follow Example 2 in the readme of github/.../xmltools closely, but I got the error:

...
+   dplyr::mutate_all(empty_as_na)
Error: Argument 4 must be length 4, not 39

Any help fixing this (or different strategy with complete example) would be greatly appreciated.

The code (with reproducible error) is:

file <- "https://www.camara.leg.br/SitCamaraWS/Deputados.asmx/ObterDetalhesDeputado?ideCadastro=141428&numLegislatura="
doc <- file %>%
  xml2::read_xml()
nodeset <- doc %>%
  xml2::xml_children()
length(nodeset) # lots of nodes!
nodeset[1] %>% # lets look at ONE node's tree
  xml_view_tree()
# lets assume that most nodes share the same structure
terminal_paths <- nodeset[1] %>%
  xml_get_paths(only_terminal_parent = TRUE)

terminal_xpaths <- terminal_paths %>% ## collapse xpaths to unique only
  unlist() %>%
  unique()

# xml_to_df (XML package based)
## note that we use file, not doc, hence is_xml = FALSE
# df1 <- lapply(xpaths, xml_to_df, file = file, is_xml = FALSE, dig = FALSE) %>%
#   dplyr::bind_cols()
# df1

# xml_dig_df (xml2 package based)
## faster!
empty_as_na <- function(x){
  if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
  if(class(x) == "character") ifelse(as.character(x)!="", x, NA) else x
}

terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = doc) # use xml docs, not nodesets! I think this is because it searches the 'root'.
df2 <- terminal_nodesets %>%
  purrr::map(xml_dig_df) %>%
  purrr::map(dplyr::bind_rows) %>%
  dplyr::bind_cols() %>%
  dplyr::mutate_all(empty_as_na)
LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • Does the example work for you? If gives me an error in the `bind_cols` step here `terminal_nodesets %>% purrr::map(xml_dig_df) %>% purrr::map(dplyr::bind_rows) %>% dplyr::bind_cols() ` stating `Error: Argument 4 must be length 4, not 39`. The list of dataframes are of unequal length hence it is not able to `cbind` them. – Ronak Shah Jun 10 '19 at 06:42
  • @RonakShah, that is the same error that I get. The only adaptation I did in the code above was to change the xml file in the first line. If you want run the original example just change it to `file <- "http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/wsu.xml"` – LucasMation Jun 10 '19 at 13:51

1 Answers1

1

Here is an approach with XML package.

library(tidyverse)
library(XML)

df = xmlInternalTreeParse("./Data/ObterDetalhesDeputado.xml")
df_root = xmlRoot(df)
df_children = xmlChildren(df_root)

df_flattened = map_dfr(df_children,  ~.x %>% 
                         xmlToList() %>% 
                         unlist %>% 
                         stack %>% 
                         mutate(ind = as.character(ind),
                                ind = make.unique(ind)) %>% # for duplicate identifiers
                         spread(ind, values))

Following Nodes are nested lists. So they will appear as duplicate columns with numbers affixed. You can remove them accordingly.

cargosComissoes 2
partidoAtual 3
gabinete 3
historicoLider 4
comissoes 11
Theo
  • 575
  • 3
  • 8
  • tks. This works. I first collect 1000 xml files in a list. Then lapply your flattening scheme above for that list. It takes about 10min to run but works. – LucasMation Jun 10 '19 at 15:45