- Build the variable urls in the loop using
sprintf
.
- Extract all the body text from paragraph nodes.
- Remove any vectors returning a length(0)
- I added a step to include all of the body text annotated by a prepended
[paragraph - n]
for reference..because well...friends don't let friends waste data or make multiple http requests.
- Build a data frame for each iteration in your topics list in the form of below:
Bind all of the data.frames in the list into one...
wiki_url : should be obvious
- topic: from the topics list
- info_summary: The first paragraph (you mentioned in your post)
all_info: In case you need more..ya know.
Note that I use an older, source version of rvest
for ease of understanding i'm simply assigning the name html to what would be your read_html.
library(rvest)
library(jsonlite)
html <- rvest::read_html
wiki_base <- "https://en.wikipedia.org/wiki/%s"
my_table <- lapply(sprintf(wiki_base, topic), function(i){
raw_1 <- html_text(html_nodes(html(i),"p"))
raw_valid <- raw_1[nchar(raw_1)>0]
all_info <- lapply(1:length(raw_valid), function(i){
sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]])
}) %>% paste0(collapse = "")
data.frame(wiki_url = i,
topic = basename(i),
info_summary = raw_valid[[1]],
trimws(all_info),
stringsAsFactors = FALSE)
}) %>% rbind.pages
> str(my_table)
'data.frame': 3 obs. of 4 variables:
$ wiki_url : chr "https://en.wikipedia.org/wiki/Neutron star" "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A"
$ topic : chr "Neutron star" "Black hole" "sagittarius A"
$ info_summary: chr "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__
$ all_info : chr " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__
EDIT
A function for error handling.... returns a logical. So this becomes our first step.
url_works <- function(url){
tryCatch(
identical(status_code(HEAD(url)),200L),
error = function(e){
FALSE
})
}
Based on your use of 'exoplanet' Here is all of the applicable data from the wiki page:
exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]
str(exo_data)
'data.frame': 2048 obs. of 16 variables:
$ Name : chr "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ...
$ bf : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mass (Jupiter mass) : num 0.004 0.0014 NA NA 0.1419 ...
$ Radius (Jupiter radii) : num NA 0.054 0.114 0.071 1.012 ...
$ Period (days) : num 11.186 0.177 4.195 6.356 19.224 ...
$ Semi-major axis (AU) : num 0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ...
$ Ecc. : num 0.35 1.012 NA NA 0.0626 ...
$ Inc. (deg) : num NA 72 89.4 88.2 87.1 ...
$ Temp. (K) : num 234 NA NA NA 707 ...
$ Discovery method : chr "radial vel." "transit" "transit" "transit" ...
$ Disc. Year : int 2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ...
$ Distance (pc) : num 1.29 NA NA NA 650 ...
$ Host star mass (solar masses) : num 0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ...
$ Host star radius (solar radii): num 0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ...
$ Host star temp. (K) : num 3024 3584 3584 3584 5722 ...
$ Remarks : chr "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl
y Earth-like." "controversial" "controversial" "controversial" ...
test our url_works function on random sample of the table
tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name
Now lets build a ref table with the Name, url to check, and a logical if the url is valid, and in one step create a list of two data frames, one containing the urls that don't exists....and the other that do. The ones that check out we can run through the above function with no issues. This way the error handling is done before we actually start trying to parse in a loop. Avoids headaches and gives a reference ack to what items need to be further looked into.
b <- ldply(sprintf('https://en.wikipedia.org/wiki/%s',tests), function(i){
data.frame(name = basename(i), url_checked = i,url_valid = url_works(i))
}) %>%split(.$url_valid)
> str(b)
List of 2
$ FALSE:'data.frame': 24 obs. of 3 variables:
..$ name : chr [1:24] "Kepler-539c" "HD 142 A c" "WASP-44 b" "Kepler-280 b" ...
..$ url_checked: chr [1:24] "https://en.wikipedia.org/wiki/Kepler-539c" "https://en.wikipedia.org/wiki/HD 142 A c" "https://en.wikipedia.org/wiki/WASP-44 b" "https://en.wikipedia.org/wiki/Kepler-280 b" ...
..$ url_valid : logi [1:24] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRUE :'data.frame': 17 obs. of 3 variables:
..$ name : chr [1:17] "HD 179079 b" "HD 47186 c" "HD 93083 b" "HD 200964 b" ...
..$ url_checked: chr [1:17] "https://en.wikipedia.org/wiki/HD 179079 b" "https://en.wikipedia.org/wiki/HD 47186 c" "https://en.wikipedia.org/wiki/HD 93083 b" "https://en.wikipedia.org/wiki/HD 200964 b" ...
..$ url_valid : logi [1:17] TRUE TRUE TRUE TRUE TRUE TRUE ...
Obviously the second item of the list contains the data frame with valid urls, so apply the prior function to the url column in that one. Note that I sampled the table of all planets for purposes of explanation...There are 2400 some-odd names, so that check will take a min or two to run in your case. Hope that wraps it up for you.