Scraping and extracting XML sitemap elements using R and Rvest

Question

I need to extract a large number of XML sitemap elements from multiple xml files using Rvest. I have been able to extract html_nodes from webpages using xpaths, but for xml files this is new to me.

And, I can't find a Stackoverflow question that lets me parse an xml file address, rather than parsing a large text chunk of XML.

Example of what I have used for html:

library(dplyr)
library(rvest)

webpage <- "https://www.example.co.uk/"

data <- webpage %>%
  read_html() %>%
  html_nodes("any given node goes here") %>%
  html_text()

How do I adapt this to take a "loc" XML file element from an XML file (parsing the address) that looks like this:

<urlset>
<url>
<loc>https://www.example.co.uk/</loc>
<lastmod>2020-05-01</lastmod>
<changefreq>always</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://www.example.co.uk/news</loc>
<changefreq>always</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/uk</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/weather</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/world</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>

Here is what I have changed in the script kindly provided by Dave:

library(xml2)

#list of files to process
fnames<-c("xml1.xml")

dfs<-lapply(fnames, function(fname) {
  doc<-read_xml(fname)

  #find loc and lastmod
  loc<-trimws(xml_text(xml_find_all(doc, ".//loc")))
  lastmod<-trimws(xml_text(xml_find_all(doc, ".//last")))

  #find all of the nodes/records under the urlset node
  nodes<-xml_children(xml_find_all(doc, ".//urlset"))

  #find the sub nodes names and values
  nodenames<-xml_name(nodes)
  nodevalues<-trimws(xml_text(nodes))

  #make data frame of all the values
  df<-data.frame(file=fname, loc=loc, lastmod=lastmod, node.names=nodenames, 
                 values=nodevalues, stringsAsFactors = FALSE, nrow(0))

})

#Make one long df
longdf<-do.call(rbind, dfs)

#make into a wide format
library(tidyr)
finalanswer<-spread(longdf, key=node.names, value=values)

If it is XML then you just need the xml2 package (rvest is an extension from this package). See this question as a start: https://stackoverflow.com/questions/54237549/xml-data-in-r-different-filestructure/54241010#54241010 — Dave2e, May 01 '20 at 11:57
Thanks but I get "error: arguments imply differing numbers of row: 1, 0" — Chris Ioannou, May 01 '20 at 14:39
I've edited the above to show the other file i would like to extract the element from. Maybe this is why I get an issue differing rows. Could you help? — Chris Ioannou, May 01 '20 at 14:42
The two files have different structures, so yes that would cause errors. The first one has "sitemap" for parent nodes the second has "url". Are there other types of files or just these 2? If it is just the 2, My approach would to write two different functions to parse each type and then merge the results. If there are more than 2 or 3, then this becomes a more difficult since everything needs relative references and the nodes can't be named directly. — Dave2e, May 01 '20 at 14:48
I've changed the file format needed, for simplicity, and also added the script I am using based on your answer. What am I not correctly adapting in your script. I keep getting the same error. Note I have the XML file set in the working directory, and it is correctly named. — Chris Ioannou, May 01 '20 at 15:08

Dave2e · Accepted Answer · 2020-05-01T15:58:03.003

Since the number of children per url node is different is a working approach:

file<-read_xml(text)

library(dplyr)

#find parent nodes
parents <-xml_find_all(file, ".//url")

#parse each child
dfs<-lapply(parents, function(node){
  #Find all children
  nodes <- xml_children(node)

  #get node name and value
  nodenames<-  xml_name(nodes)
  values <- xml_text(nodes)

  #made data frame with results
  df<- as.data.frame(t(values), stringsAsFactors=FALSE)
  names(df)<-nodenames
  df
})

#Make find answer
answer<-bind_rows(dfs)

Since you have multiple files, you could enclose the script in an outer loop to cycle the through the file list. Of course is a loop within a loop thus performance will suffer if there is a large number of files and a large number of parent nodes in each file.

Alternative: If the number of children nodes are short then it is best to parse them directly and avoid the above lapply loop.

loc <- xml_find_first(parents, ".//loc") %>% xml_text()
lastmod <- xml_find_first(parents, ".//lastmod") %>% xml_text()
changefreq <- xml_find_first(parents, ".//changefreq") %>% xml_text()
priority <- xml_find_first(parents, ".//priority") %>% xml_text()

answer<-data.frame(loc, lastmod, chargefreq, priority)

Incredible, Dave. Thanks so much for following up and adapting this for me :) — Chris Ioannou, May 01 '20 at 16:24

score 0 · Answer 2 · answered May 14 '20 at 15:33

I have this code i write some time ago to check all the XML in a file and collect specific nodes of a pattern of XML, with a little tweak you can use something maybe.

library("xml2")
library("XML")

setwd("/xml")
dir <- dir()
tabela=matrix(NA,nrow=length(a),ncol=1)

  for(i in 1:length(dir)){

  visitNode <- function(node) {#Recursive Function to visit the XML tree (depth first)
    if (is.null(node)) {#leaf node reached. Turn back
      return()
    }
    print(paste("Node: ", xmlName(node)))
      num.children = xmlSize(node)

    if(num.children == 0 ) {# Add your code to process the leaf node here
      print(      paste("   ", xmlValue(node)))
    }
    if (num.children > 0){#Go one level deeper
      for (i in 1 : num.children) {
        visitNode(node[[i]][["NFe"]]) #the i-th child of node
      }
    }

  }
  xmlfile <- dir[i]
  xtree <- xmlInternalTreeParse(xmlfile)
  root <- xmlRoot(xtree)
  dataxml <- visitNode(root)
  dataxml <- xmlToList(root)


  df<- as.data.frame(matrix(unlist(dataxml$NFe$infNFe$infAdic$infCpl), nrow=length(dataxml$NFe$infNFe$infAdic$infCpl),byrow=TRUE))

Scraping and extracting XML sitemap elements using R and Rvest

2 Answers2

Linked