1

I want to access the revision details in the XML output of a Wikipedia article. In other words, I want a data.frame structure with one row for each revision (which as I understand the tree structure should be //page/revision) and one column for each element of the sublist revision (importantly there might be different elements in different revision sublists).

The data:

require(XML)
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
          body = "pages=Euroswydd&offset=1&limit=2&action=submit")
stop_for_status(r)
xml <- content(r, "text")
xml_data <- xmlToList(xml)
str(xml_data)

which outputs

List of 3
$ siteinfo:List of 6
..$ sitename  : chr "Wikipedia"
..$ dbname    : chr "enwiki"
..$ base      : chr "https://en.wikipedia.org/wiki/Main_Page"
..$ generator : chr "MediaWiki 1.27.0-wmf.17"
..$ case      : chr "first-letter"
..$ namespaces:List of 35
... [not of interest] ...
$ page    :List of 5
..$ title   : chr "Euroswydd"
..$ ns      : chr "0"
..$ id      : chr "86146"
..$ revision:List of 7
.. ..$ id         : chr "4028683"
.. ..$ timestamp  : chr "2002-09-16T03:24:52Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "TUF-KAT"
.. .. ..$ id      : chr "8351"
.. ..$ model      : chr "wikitext"
.. ..$ format     : chr "text/x-wiki"
.. ..$ text       :List of 2
.. .. ..$ text  : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him.  Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "163"
.. ..$ sha1       : chr "ivzrvt6jgoga4ndtrdmz5ldg5elfoma"
..$ revision:List of 9
.. ..$ id         : chr "9228569"
.. ..$ parentid   : chr "4028683"
.. ..$ timestamp  : chr "2004-06-11T02:22:33Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "Gtrmp"
.. .. ..$ id      : chr "38984"
.. ..$ minor      : NULL
.. ..$ model      : chr "wikitext"
.. ..$ format     : chr "text/x-wiki"
.. ..$ text       :List of 2
.. .. ..$ text  : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him.  Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. ..@ .Data: chr [1:2] "preserve" "203"
.. ..$ sha1       : chr "kwd09htf87bjc51y2z9ykpnasu7nqle"
$ .attrs  :Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. ..@ .Data: chr [1:3] "http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" "0.10" "en"

Now

I can access the first revision list with xml_data[['page']][['revision']]. But how can access the second revision?

CptNemo
  • 6,455
  • 16
  • 58
  • 107
  • For XML processing, XPATH is a good way the way to go. With `xml_data[['page']][['revision']]` you access the first revision list, with an Iterator and `->next()` you will get the second element. Look at that code: http://stackoverflow.com/a/14448325/390462 – BendaThierry.com Mar 19 '16 at 09:38
  • Check out the `WikipediR` package; it's got `revision_content` and `revision_diff` functions. – alistaire Mar 19 '16 at 09:41

1 Answers1

2

Usind rvest you can do something like this as follows:

Helper function:

parse_nested <- function(x, prefix = ''){
  kids = x %>% xml_children()
  ind = which(sapply(kids, xml_length) != 0)
  if(!length(ind)){
    return(setNames(kids %>% xml_text(), 
                    paste0(prefix,kids %>% xml_name())))
  }
  nested = parse_nested(kids[ind], 
                        prefix = paste0(prefix, kids[ind] %>% xml_name(), "_"))
  unnested = setNames(kids[-ind] %>% xml_text(), 
                      paste0(prefix, kids[-ind] %>% xml_name()))
  as.list(c(unnested, nested))
}

Actual Code:

require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
          body = "pages=Euroswydd&offset=1&limit=2&action=submit")

require(rvest)
doc <- read_html(r)
doc %>% 
  html_nodes("revision") %>% 
  lapply(parse_nested) %>% #Parse each revison seperately
  data.table::rbindlist(fill=TRUE) #combine them

Result (a data.table):

        id            timestamp    model      format ---
1: 4028683 2002-09-16T03:24:52Z wikitext text/x-wiki ---
2: 9228569 2004-06-11T02:22:33Z wikitext text/x-wiki ---

Thanks to @Arun for pointing out, that data.table::rbindlist accepts list.

plyr::rbind.fill can be used as alternative to data.table::rbindlist.

Rentrop
  • 20,979
  • 10
  • 72
  • 100