Scrape with a loop and avoid 404 error

Question

I am trying to scrape wiki for certain astronomy related definitions for my project. The code works pretty well, but I am not able to avoid 404s. I tried tryCatch. I think I am missing something here.

I am looking for a way overcome 404s while running a loop. Here is my code:

library(rvest)
library(httr)
library(XML)
library(tm)


topic<-c("Neutron star", "Black hole", "sagittarius A")

for(i in topic){

  site<- paste("https://en.wikipedia.org/wiki/", i)
  site <- read_html(site)

  stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph
  #error = function(e){NA}

  stats[["topic"]] <- i

  stats<- gsub('\\[.*?\\]', '', stats)
  #stats<-stats[!duplicated(stats),]
  #out.file <- data.frame(rbind(stats,F[i]))

  output<-rbind(stats,i)

}

I presume you mean note the error and then skip to the next iteration of the loop? — sebastian-c, Sep 22 '16 at 07:57
Relevant/maybe duplicate post http://stackoverflow.com/questions/8093914 — zx8754, Sep 22 '16 at 08:06
As a side note, have a look at http://stackoverflow.com/questions/14693956/how-can-i-prevent-rbind-from-geting-really-slow-as-dataframe-grows-larger — konvas, Sep 22 '16 at 09:50
Use `httr::GET()` to retrieve the content of the target URL. Pass in `content(res, as="text", encoding="UTF-8")` (assuming you stored the result of the `httr::GET` call in `res`) to `read_html()`. You can test `res` for status code. Also, you're really not doing much good with `read_html` when you are passing it's already parsed content and `xml2` objects to the `XML` package functions. I'd spend some time really learning the various functions and packages you're trying to use before finishing this project. — hrbrmstr, Sep 22 '16 at 11:06
I answered below....but I can't figure out why you have the tm package sourced...Or why the first paragraph is; by default; the only data you need. But anyways... — Carl Boneri, Sep 22 '16 at 12:58
hrbrmstr, I was trying to error handling with httr. It did not work, so I removed that part. I tried tryCatch, and that is where I felt like I am missing something. I think I will simplify the code like you mentioned, but I needed it to be clear, since I am new to R and scraping. — SKD, Sep 22 '16 at 18:39

Carl Boneri · Accepted Answer · 2016-09-22T21:36:44.680

Build the variable urls in the loop using sprintf.
Extract all the body text from paragraph nodes.
Remove any vectors returning a length(0)
I added a step to include all of the body text annotated by a prepended [paragraph - n] for reference..because well...friends don't let friends waste data or make multiple http requests.
Build a data frame for each iteration in your topics list in the form of below:
Bind all of the data.frames in the list into one...
wiki_url : should be obvious
topic: from the topics list
info_summary: The first paragraph (you mentioned in your post)
all_info: In case you need more..ya know.
Note that I use an older, source version of rvest

for ease of understanding i'm simply assigning the name html to what would be your read_html.

   library(rvest)
   library(jsonlite)

   html <- rvest::read_html

   wiki_base <- "https://en.wikipedia.org/wiki/%s"

   my_table <- lapply(sprintf(wiki_base, topic), function(i){

        raw_1 <- html_text(html_nodes(html(i),"p"))

        raw_valid <- raw_1[nchar(raw_1)>0]

        all_info <- lapply(1:length(raw_valid), function(i){
            sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
        }) %>% paste0(collapse = "")

        data.frame(wiki_url = i, 
                   topic = basename(i),
                   info_summary = raw_valid[[1]],
                   trimws(all_info),
                   stringsAsFactors = FALSE)

    }) %>% rbind.pages

   > str(my_table)
   'data.frame':    3 obs. of  4 variables:
    $ wiki_url    : chr  "https://en.wikipedia.org/wiki/Neutron star"     "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
    $ topic       : chr  "Neutron star" "Black hole" "sagittarius A"
    $ info_summary: chr  "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
    $ all_info    : chr  " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__

EDIT

A function for error handling.... returns a logical. So this becomes our first step.

url_works <- function(url){
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){
        FALSE
    })
}

Based on your use of 'exoplanet' Here is all of the applicable data from the wiki page:

 exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]

str(exo_data)

    'data.frame':   2048 obs. of  16 variables:
 $ Name                          : chr  "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
 $ bf                            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Mass (Jupiter mass)           : num  0.004 0.0014 NA NA 0.1419 ...
 $ Radius (Jupiter radii)        : num  NA 0.054 0.114 0.071 1.012 ...
 $ Period (days)                 : num  11.186 0.177 4.195 6.356 19.224 ...
 $ Semi-major axis (AU)          : num  0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
 $ Ecc.                          : num  0.35 1.012 NA NA 0.0626 ...
 $ Inc. (deg)                    : num  NA 72 89.4 88.2 87.1 ...
 $ Temp. (K)                     : num  234 NA NA NA 707 ...
 $ Discovery method              : chr  "radial vel." "transit" "transit" "transit" ...
 $ Disc. Year                    : int  2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
 $ Distance (pc)                 : num  1.29 NA NA NA 650 ...
 $ Host star mass (solar masses) : num  0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
 $ Host star radius (solar radii): num  0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
 $ Host star temp. (K)           : num  3024 3584 3584 3584 5722 ...
 $ Remarks                       : chr  "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
 y Earth-like." "controversial" "controversial" "controversial" ...

test our url_works function on random sample of the table

tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name

Now lets build a ref table with the Name, url to check, and a logical if the url is valid, and in one step create a list of two data frames, one containing the urls that don't exists....and the other that do. The ones that check out we can run through the above function with no issues. This way the error handling is done before we actually start trying to parse in a loop. Avoids headaches and gives a reference ack to what items need to be further looked into.

b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)

> str(b)
List of 2
 $ FALSE:'data.frame':  24 obs. of  3 variables:
  ..$ name       : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
  ..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
  ..$ url_valid  : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRUE :'data.frame':  17 obs. of  3 variables:
  ..$ name       : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
  ..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
  ..$ url_valid  : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...

Obviously the second item of the list contains the data frame with valid urls, so apply the prior function to the url column in that one. Note that I sampled the table of all planets for purposes of explanation...There are 2400 some-odd names, so that check will take a min or two to run in your case. Hope that wraps it up for you.

I was trying to clean the text with regex, so I tried tm package. I needed the first paragraph only to test my code. I know it wouldn't take lot of time, but I have a long list of topics. What you gave works quite well, but I do not see a step for error handling. — SKD, Sep 22 '16 at 18:27
Sebastian, you are right! I am looking for a way to either skip the error URL or note the error in the variable, and move on to the next item. — SKD, Sep 22 '16 at 18:31
where would the root of the error be...that's what is unclear. is the error that an item in the list may not have a page? your example didn't throw errors...so I assumed it was the basis. you need to figure out where the error would come from..and as far as the regex...that's pretty vague but no matter. — Carl Boneri, Sep 22 '16 at 20:33
I have a list of items related to astronomy which may or may not be accurate. I know most of them are accurate, but some may not be. For example, there is an exoplanet called HAT-P-32b. I have this in the list, as HAT_P_32b. I know this was a mistake, so I corrected it manually. I may not be able to go through the entire list to correct each of them, so if the request throws an error, then I would correct them and run those entries again. Please share your thoughts! By the way, I like the way you explained your steps above the solution. Thank you! — SKD, Sep 22 '16 at 20:57
This works perfectly well! It didn't run the first time. I loaded httr library and then it worked! Superb! I thank you for the extra bit of code you shared for the table of contents, and paragraph break up. I like it better than my original idea. Thank you again! Cheers! — SKD, Sep 23 '16 at 07:07
Science! glad it helped...it's not optimised obviously; but I have to admit splitting the data.frame into a working/non-working collection was a mid-stream idea that I think I'll be implementing in other work now haha. so thanks back at ya! — Carl Boneri, Sep 23 '16 at 10:38

Scrape with a loop and avoid 404 error

1 Answers1

EDIT

Linked