Loop to scrape data from Wikipedia in R

Question

I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks like:

https://en.wikipedia.org/wiki/Deaths_in_"MONTH"_"YEAR"

For example, this link leads to the notable deaths in March, 2014.

https://en.wikipedia.org/wiki/Deaths_in_March_2014

I have located the CSS location of the lists I need to be ""#mw-content-text h3+ ul li" and extracted it for a specific link successfully. Now I'm trying to write a loop to go through the months and any years that I choose. I think it's a pretty straightforward nested loop but I'm getting errors when testing it just on 2015.

library(rvest)
data = data.frame()
 mlist = c("January","February","March","April","May","June","July","August",
              "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data = rbind(data,text,stringsAsFactors=FALSE)
      }
 }

When I comment out the line:

data = rbind(data,text,stringsAsFactors=FALSE)

no errors are returned so it's clearly related to this bit. I am posting my whole code for other comments as well. The goal here is to loop through many years and then focus on the distribution over the years and months. For this I just need to keep the age, month, and year of death.

Thank you!

EDIT: Sorry, they are technically warnings, not errors. I get over 50 of them and when I try to look at "data" it is a giant mess.

When I run this code not as a loop on one specific URL, it works fine and returns a readable output.

site = read_html("https://en.wikipedia.org/wiki/Deaths_in_January_2015")
fnames = html_nodes(site,"#mw-content-text h3+ ul li")
text = html_text(fnames)

Here are a couple of rows from that data set:

text[1:5]
[1] "Barbara Atkinson, 88, British actress (Z-Cars).[1]"                                         
[2] "Staryl C. Austin, 94, American air force brigadier general.[2]"                             
[3] "Ulrich Beck, 70, German sociologist, heart attack.[3]"                                      
[4] "Fiona Cumming, 77, British television director (Doctor Who).[4]"                            
[5] "Eric Cunningham, 65, Canadian politician, Ontario MPP for Wentworth North (1975â€“1984).[5]"

I didn't get any error messages. What errors did you get? What's the output dataset supposed to look like? — Hack-R, Jun 22 '16 at 19:41
Please do post names of the packages that you are using so that one can easily reproduce your error. — abhiieor, Jun 22 '16 at 19:48
I edited the post to include an example of a successful scrape using this template and the library rvest that I am using. — user137698, Jun 22 '16 at 19:50

Dave2e · Accepted Answer · 2016-06-22T20:49:53.410

html_text(fnames) returns an array. Your problem is trying append an array onto a dataframe.
Try converting your variable text to a dataframe before appending:

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)

    temp<-data.frame(text, stringsAsFactors = FALSE)

    data = rbind(data,temp)
    }
 }

This is not the best technique for the performance reasons. Each time through the loop, the memory for the dataframe is reallocated which slows performance, with this being a one time event and a limit number of requests it should be manageable in this case.

Since posting I saw that the rbind was causing an issue due to the structure of "text". I used an "as. matrix" conversion that was sort of working, but it created a data frame with 6k levels of a factor. Your code gets rid of this issue! — user137698, Jun 22 '16 at 20:52

score 0 · Answer 2 · answered Jun 22 '16 at 20:34

I wasn't able to get the same error that you got, but I think I know what you want to do.

I have a feeling this has something to do with the unequal number of deaths in each month.

I'd suggest doing it this way

mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    assign(mlist[m],text)
  }
}

This creates a character list for each month's deaths.

An alternative (for easier use later in a loop to join them) is to use a list:

data = vector("list",12)
mlist = c("January","February","March","April","May","June","July","August",
      "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
                       "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data[[m]] = text
  }
}

Personally, I don't like dealing with lists in R. But this seems to be the best work around.

Interesting, thank you! I will eventually want to change y to loop over multiple years, so I'll need to tweak this to be able to add more rows but this runs without errors for me. — user137698, Jun 22 '16 at 20:42

Loop to scrape data from Wikipedia in R

2 Answers2

Linked