R: Webscraping irregular blocks of values

Question

So I am attempting to webscrape a webpage that has irregular blocks of data that is organized in a manner easy to spot with the eye. Let's imagine we are looking at wikipedia. If I am scraping the text from articles of the following link I end up with 33 entries. If I instead grab just the headers, I end up with only 7 (see code below). This result does not surprise us as we know that some sections of articles have multiple paragraphs while others have only one or no paragraph text.

My question though is, how do I associate my headers with my texts. If there were the same number of paragraphs per header or some multiple, this would be trivial.

library(rvest)
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")

wikitext <- wiki %>% 
  html_nodes('p+ ul li , p') %>%
  html_text(trim=TRUE)

wikiheading <- wiki %>% 
  html_nodes('.mw-headline') %>%
  html_text(trim=TRUE)

Can you be a bit more clear on the desired result here? Perhaps use a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that is more simple and stable to make it easier to verify — MrFlick, Jul 21 '15 at 20:17

johnson-shuffle · Accepted Answer · 2015-07-22T06:44:38.800

3

This will give you a list called content whose elements are named according to the headings and contain the corresponding text.

library(rvest) # Assumes version 0.2.0.9 is installed not currently on CRAN
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")

# This node set contains the headings and text
wikicontent <- wiki %>% 
  html_nodes("div[id='mw-content-text']") %>%
  xml_children()

# Locates the positions of the headings
headings <- sapply(wikicontent,xml_name) 
headings <- c(grep("h2",headings),length(headings)-1)

# Loop through the headings keeping the stuff in-between them as content
content <- list()
for (i in 1:(length(headings)-1)) {
  foo <- wikicontent[headings[i]:(headings[i+1]-1)]
  foo.title <- xml_text(foo[[1]])
  foo.content <- xml_text(foo[-c(1)])
  content[[i]] <- foo.content
  names(content)[i] <- foo.title
}

The key was spotting the mw-content-text node which has all the things you want as children.

edited Jul 22 '15 at 06:44

answered Jul 21 '15 at 21:40

johnson-shuffle

1,023
5
11

looks promising. `xml_name` not found though – Francis Smart Jul 21 '15 at 22:04
1

sorry. i thought package `xml2` loaded with `rvest`. it is in there. – johnson-shuffle Jul 21 '15 at 22:06
Afraid the `xml_children()` function is returning this error for me [[ Error in UseMethod("nodeset_apply") : no applicable method for 'nodeset_apply' applied to an object of class "XMLNodeSet"]] Can you verify that this code is working after running `rm(list=ls())`? Thanks, sorry for the hassle. – Francis Smart Jul 22 '15 at 06:17
1

i know what the problem is. i am using `rvest` 0.2.0.9 which is not on CRAN yet. you can get it via `install_github("hadley/rvest")` as long as you have `devtools` installed. that is why `xml2` wasn't automatically loading. pretty sure this will get things working on your end. let me know. – johnson-shuffle Jul 22 '15 at 06:28
That did it! Thanks so much. – Francis Smart Jul 22 '15 at 11:35

R: Webscraping irregular blocks of values

1 Answers1