3

I need to parse 2000 XML Files. I managed setting that I can automatically get my data from the files. Since I am a complete beginner, it maybe looks messy, here an example:

filenames <- list.files("C:/...", recursive=TRUE, full.names=TRUE, pattern=".xml")

name <- unlist(lapply(filenames, function(f) {  
  xml <- xmlParse(f)  
  xpathSApply(xml, "//...", xmlValue)
}))
data <- data.frame(name)

This works for most of my needed data but my current problem is that some files miss a certain data so I can't include them because of different number of rows. An example of what the files look like is: File 1:

<Kontaktdaten>
   <Name> Name </Name>
   <ID>12345678</ID>
   <Kontakt_Zugang>
       <Strasse>ABC-Strasse</Strasse>
       <Hausnummer>1</Hausnummer>
       <Postleitzahl>12345</Postleitzahl>
       <Ort>ABC</Ort>
   </Kontakt_Zugang> 
</Kontaktdaten>

File 2 (where "Hausnummer" is missing for example):

<Kontaktdaten>
   <Name> Name2 </Name>
   <ID>8765321</ID>
   <Kontakt_Zugang>
       <Strasse>CBA-Strasse</Strasse>
       <Postleitzahl>54321</Postleitzahl>
       <Ort>CBA</Ort>
   </Kontakt_Zugang> 
</Kontaktdaten>

Is there any way how I can combine them anyway in one data.frame or create a second data.frame only with the "Hausnummer" and the ID?

EDIT: This is only an example to show my problem. The original files are up to 500 nodes long, some of them are doubled.

Tamy
  • 31
  • 4
  • This question is similar to yours. https://stackoverflow.com/questions/53949273/parsing-large-xml-to-dataframe-in-r/53949609#53949609 – Dave2e Jan 17 '19 at 14:01
  • It looks like this person wants to work with one big file, however I have many different files – Tamy Jan 17 '19 at 14:22

3 Answers3

1

Here is a solution of parsing each xml file, creating a list of the sub nodes in the individual files, then combining all the lists, and then converting to the desired format.

See the code comments for the step by step instructions.

library(xml2)

#list of files to process
fnames<-c("xml1.xml", "xml2.xml")

dfs<-lapply(fnames, function(fname) {
  doc<-read_xml(fname)


  #find Name and ID
  Name<-trimws(xml_text(xml_find_all(doc, ".//Name")))
  ID<-trimws(xml_text(xml_find_all(doc, ".//ID")))

  #find all of the nodes/records under the Kontakt_Zugang node
  nodes<-xml_children(xml_find_all(doc, ".//Kontakt_Zugang"))

  #find the sub nodes names and values
  nodenames<-xml_name(nodes)
  nodevalues<-trimws(xml_text(nodes))

  #make data frame of all the values
  df<-data.frame(file=fname, Name=Name, ID=ID, node.names=nodenames, 
             values=nodevalues, stringsAsFactors = FALSE)

})

#Make one long df
longdf<-do.call(rbind, dfs)

#make into a wide format
library(tidyr)
finalanswer<-spread(longdf, key=node.names, value=values)

Here is the final result:

#     file  Name       ID Hausnummer Ort Postleitzahl     Strasse
# xml1.xml  Name 12345678          1 ABC        12345 ABC-Strasse
# xml2.xml Name2  8765321       <NA> CBA        54321 CBA-Strasse
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • When trying this, I get the error that the number of rows differ, but I cannot find the mistake I mad adapting this – Tamy Jan 17 '19 at 17:58
  • It is best to just to copy and paste from here, into your editor. I am not sure what could cause that error. Is the lapply function finishing? If not, try to add a `print(fname)` under the first line in the lapply function, to identify which file is causing the issue. – Dave2e Jan 17 '19 at 18:13
  • 1
    I only changed the .// since the files I am working with have up to 500 nodes and sometimes are doubled so I inserted the exact "path" – Tamy Jan 17 '19 at 18:50
  • Sounds like your actual files are more complicated then the samples you posted. If your files have multiple IDs and names then the code above most likely won't work. I made a slight tweak to df<-data.frame(file=fname... it may or may not help. – Dave2e Jan 17 '19 at 20:45
  • Sorry for my late reply, I got to test your way and it works for some data. Here is what my current problem is: https://imgur.com/a/D3a0v4E – Tamy Jan 24 '19 at 10:17
1

Consider the special purpose language, XSLT, designed to transform XML files for end use solutions such as flattening the nested node Kontakt_Zugang for import into R and migrated into data frame.

XSLT (save as an .xsl file to be parsed like any .xml file into R)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="Kontakt_Zugang">
    <xsl:apply-templates select="@*|node()"/>
</xsl:template>

</xsl:stylesheet>

Online Demo

R

library(xml2)
library(xslt)

# RETRIEVE XML FILE NAMES
filenames <- list.files("C:/...", recursive=TRUE, full.names=TRUE, pattern=".xml")
all_cols <- c("Name", "ID", "Strasse", "Hausnummer", "Postleitzahl", "Ort")

# PARSE XSLT
style <- read_xml("/path/to/xslt_script.xsl", package = "xslt")

df_list <- lapply(filenames, function(f) {  
  # PARSE XML
  xml <- xml2::read_xml(f)    
  # TRANSFORM INPUT INTO OUTPUT
  new_xml <- xslt::xml_xslt(xml, style)

  # BUILD DATA FRAME
  vals <- xml_children(xml_find_all(new_xml, "//Kontaktdaten"))
  df <- setNames(data.frame(t(trimws(xml_text(vals)))), xml_name(vals))

  # FILL IN MISSING COLUMNS
  df[all_cols[!(all_cols %in% colnames(df))]] <- NA

  return(df[all_cols])
})

final_df <- do.call(rbind, df_list)
final_df
#    Name       ID     Strasse Hausnummer Postleitzahl Ort
# 1  Name 12345678 ABC-Strasse          1        12345 ABC
# 2 Name2  8765321 CBA-Strasse       <NA>        54321 CBA

By the way, because XSLT is a special-purpose language, it is not restricted to R but any language such as Java, PHP, Python that supports it and even external processors that R can make a command line call to run. As example, below uses Unix's (i.e., Mac and Linux) xsltproc:

# COMMAND LINE CALL TO UNIX'S XSLTPROC (ALTERNATIVE TO xslt PACKAGE)
system("xsltproc -o /path/to/input.xml /path/to/xslt_script.xsl /path/to/output.xml")
doc <- xmlParse("/path/to/output.xml")
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

I created a single XML file that I believe resembles the one you maybe working with. I parse this file and pull out data using the function get_xml_dat.

file <- list.files('~/R/', pattern = '.xml', full.names = TRUE)


get_xml_dat <- function(xml_file, main_node, values) {
  require(tidyverse)
  require(rvest)
  require(xml2)


  nodeset <- xml_file %>% read_html() %>% html_nodes(tolower(main_node)) # pull out node with data from XML

  node_values <- paste0("<", tolower(values), ">.*</", tolower(values), ">") # create pattern to extract values

  dat <-  str_extract_all(as.character(nodeset), paste0(node_values, collapse = "|")) # extract values from nodeset

  lapply(dat, function(foo) {
    foo %>% gsub("</.*>|<", "", .) %>% # clean up data values
      tibble(V1 = .) %>% # transform string of values to long data_frames
      separate(., # separate node name from node value
               col = V1,
               into = c("cols", "vals"),
               sep = ">"
      ) %>%
      spread(cols, vals) # long data_frame to wide data_frame
  }) %>%
    bind_rows() %>% # bind list of data_frames
    select(tolower(values)) # orders columns
}

Output

> get_xml_dat(
+   xml_file = file, # XML file you want to get data from
+   main_node = 'Kontaktdaten', # node where the data is located
+   values = c('Name', 'ID', 'Hausnummer', 'Postleitzahl', 'Ort', 'Strasse') # values you want to get from the main_node
+ )
# A tibble: 3 x 6
  name         id     hausnummer postleitzahl ort   strasse    
  <chr>        <chr>  <chr>      <chr>        <chr> <chr>      
1 " Name_ABC " 912283 1          12345        ABC   ABC-Strasse
2 " Name_DEF " 123456 NA         12345        DEF   DEF-Strasse
3 " Name_XYZ " 123456 3          12345        XYZ   XYZ-Strasse

Data

XML file that is loaded into R.

<Qualitaetsbericht>
    <Krankenhaus>
        <Kontaktdaten>
            <Name> Name_ABC </Name>
            <ID>912283</ID>
            <Kontakt_Zugang>
                <Strasse>ABC-Strasse</Strasse>
                <Hausnummer>1</Hausnummer>
                <Postleitzahl>12345</Postleitzahl>
                <Ort>ABC</Ort>
            </Kontakt_Zugang>
        </Kontaktdaten>
    </Krankenhaus>
    <Klinik> 
        <Kontaktdaten>
            <Name> Name_DEF </Name>
            <ID>123456</ID>
            <Kontakt_Zugang>
                <Strasse>DEF-Strasse</Strasse>
                <Postleitzahl>12345</Postleitzahl>
                <Ort>DEF</Ort>
            </Kontakt_Zugang>
        </Kontaktdaten>
    </Klinik>
    <Universitaet>
        <Kontaktdaten>
            <Name> Name_XYZ </Name>
            <ID>123456</ID>
            <Kontakt_Zugang>
                <Strasse>XYZ-Strasse</Strasse>
                <Hausnummer>3</Hausnummer>
                <Postleitzahl>12345</Postleitzahl>
                <Ort>XYZ</Ort>
            </Kontakt_Zugang>
        </Kontaktdaten>
    </Universitaet>
    <Other_DATA>
        <Some_Var0>
            <X>100</X>
            <Y>100</Y>
            <Z>100</Z>
        </Some_Var0>
    </Other_DATA>
</Qualitaetsbericht>
parkerchad81
  • 548
  • 3
  • 9
  • 1
    No reason, I actually fixed my answer using a more direct approach. – parkerchad81 Jan 17 '19 at 17:23
  • Would you kindly explain to me what to fill in? I am trying to understand it – Tamy Jan 17 '19 at 17:59
  • @Tamy I added some comments to the code. Tell me if you have other questions. – parkerchad81 Jan 17 '19 at 18:16
  • The files I am working with have 50-500 nodes which are sometimes doubles, so I guess I cannot use this? Sorry it was only an example of a file because I though I could not explain my problem without it and I can't unfortunately not post my data – Tamy Jan 17 '19 at 18:45
  • You should try the code above now. Without looking at your actual data it is as close to what you want, that I can think. – parkerchad81 Jan 17 '19 at 21:05
  • Fehler in setnames(x, value) : Can't assign 0 names to a 1 column data.table – Tamy Jan 17 '19 at 21:59
  • I would need a better depiction of your data to help. – parkerchad81 Jan 17 '19 at 22:33
  • They are XML files with up to 500 nodes and the node names like "Name" are sometimes doubled (like I have ABC but also CBA and not every file has the same nodes. I also only need some specific information out of the xml file, that's why I worked with //... so far. Right now I am looking for a solution with ifelse and I think I might have worked it out but it is a very messy way to do it. https://imgur.com/Gb0wiae – Tamy Jan 19 '19 at 13:22
  • I edited the code above to create a function that allows you more control over what you pull out of the XML. As well as, the XML file I was using to try and get closer to what you are doing. – parkerchad81 Jan 20 '19 at 00:20
  • I tried it now again since I only have access to the data a few times a week and I get this error: Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Expecting a single string value: [type=character; extent=10]. – Tamy Jan 23 '19 at 13:19