Adding values form a website to a table when scraping HTML

Question

This is the data that I need:

https://www.lpi.usra.edu/meteor/metbull.php?sea=%2A&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=200&map=ge&browse=&country=All&srt=name&categ=Ungrouped+achondrites&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0

I already imported the table into R:

library(tidyverse)
library(rvest)

webpage <- read_html("https://www.lpi.usra.edu/meteor/metbull.php?sea=%2A&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=200&map=ge&browse=&country=All&srt=name&categ=Ungrouped+achondrites&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0")

tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[5] %>%
  html_table(fill = TRUE)

data = as.tibble(tbls_ls[[1]])

Yet, I need to add one more thing to the table. For some meteorites, there are oxygen isotope values available. One can see this when clicking on the name of the meteorite under the section "plots". When clicking on the plot, we get redirected to a page where we have the three isotope values. What I want to do is to add three columns to my table, containing the respective isotope values for each meteorite. I tried writing code for each "plot" section separately, but I feel like there could be a much more elegant solution for this.

QHarr · Accepted Answer · 2021-03-04T03:06:56.017

You could grab the table without isotopes, then mimic the post request the page does if you decide to go with isotopes; then left-join the two on Name column. You will get more rows back than were in left table (no isotopes) because there are multiple Change values, but this matches with what you see in the method of viewing isotopes you describe, where there are comma separated lists of values against isotopes, within plots, rather than split out by rows.

I go for a more selective css selector to target the specific table of interest initially, rather than indexing into lists.

I use write_excel_csv to preserve the character encoding of headers on write out (an idea I got from @stefan).

You can remove columns you don't want in output from joint_table before writing out (subset/select etc).

r

library(dyplr)
library(httr)
library(rvest)
library(readr)
library(magrittr)
library(stringr)

webpage <- read_html("https://www.lpi.usra.edu/meteor/metbull.php?sea=%2A&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=200&map=ge&browse=&country=All&srt=name&categ=Ungrouped+achondrites&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0")

no_isotopes <- webpage %>%
  html_node("#maintable") %>%
  html_table(fill = T) 

data <- list(
  'sfor' = "names",
  'stype' = "contains",
  'country' = "All",
  'categ' = "Ungrouped achondrites",
  'page' = "0",
  'map' = "ge",
  'srt' = "name",
  'lrec' = "200",
  'pnt' = "Oxygen isotopes",
  'mblist' = "All",
  'snew' = "0",
  'sea' = "*"
)


r <- httr::POST(url = "https://www.lpi.usra.edu/meteor/metbull.php", body = data)

isotopes <- content(r, "text") %>%
  read_html(encoding = "UTF-8") %>%
  html_node("#maintable") %>%
  html_table(fill = T)


joint_table <- dplyr::left_join(no_isotopes, isotopes, by = "Name", copy = FALSE)

write_excel_csv(x = joint_table, path = "joint.csv", col_names = T, na = "")

Example output:

Edit:

Adding in the additional information that comes from other urls as per your request in comments. I had to dynamically determine which table number to pick up, as well as handle cases where no table present.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'forcats' was built under R version 4.0.3
library(httr)
#> Warning: package 'httr' was built under R version 4.0.3
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
library(readr)
library(furrr)

get_table <- function(url) {
  page <- read_html(url)
  test_list <- page %>%
    html_nodes("#maintable tr > .inside:nth-child(odd)") %>%
    html_text() # get left hand column %>%
  index <- match(TRUE, stringr::str_detect(test_list, "Data from:")) + 1
  table <- page %>%
    html_node(paste0("#maintable tr:nth-of-type(", index, ") table")) %>%
    html_table() %>%
    as_tibble()
  temp <- set_names(data.frame(t(table[, -1]), row.names = c()), t(table[, 1])) # https://www.nesono.com/node/456 ; https://stackoverflow.com/a/7970267/6241235
  return(temp)
}


start_url <- "https://www.lpi.usra.edu/meteor/metbull.php?sea=%2A&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=200&map=ge&browse=&country=All&srt=name&categ=Ungrouped+achondrites&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=0"
base <- "https://www.lpi.usra.edu"
webpage <- read_html(start_url)

no_isotopes <- webpage %>%
  html_node("#maintable") %>%
  html_table(fill = T)

data <- list(
  "sfor" = "names",
  "stype" = "contains",
  "country" = "All",
  "categ" = "Ungrouped achondrites",
  "page" = "0",
  "map" = "ge",
  "srt" = "name",
  "lrec" = "200",
  "pnt" = "Oxygen isotopes",
  "mblist" = "All",
  "snew" = "0",
  "sea" = "*"
)

r <- httr::POST(url = "https://www.lpi.usra.edu/meteor/metbull.php", body = data)

isotopes <- content(r, "text") %>%
  read_html(encoding = "UTF-8") %>%
  html_node("#maintable") %>%
  html_table(fill = T)

joint_table <- dplyr::left_join(no_isotopes, isotopes, by = "Name", copy = FALSE)

lookups <- webpage %>%
  html_node("#maintable") %>%
  html_nodes("td:nth-of-type(1) a") %>%
  map_df(~ c(html_text(.), html_attr(., "href")) %>%
    set_names("Name", "Link")) %>%
  mutate(Link = paste0(base, gsub("\\s+", "%20", Link)))

error_df <- tibble(
  `State/Prov/County:` = NA_character_,
  `Origin or pseudonym:` = NA_character_,
  `Date:` = NA_character_,
  `Latitude:` = NA_character_,
  `Longitude:` = NA_character_,
  `Mass (g):` = NA_character_,
  `Pieces:` = NA_character_,
  `Class:` = NA_character_,
  `Shock stage:` = NA_character_,
  `Fayalite (mol%):` = NA_character_,
  `Ferrosilite (mol%):` = NA_character_,
  `Wollastonite (mol%):` = NA_character_,
  `Magnetic suscept.:` = NA_character_,
  `Classifier:` = NA_character_,
  `Type spec mass (g):` = NA_character_,
  `Type spec location:` = NA_character_,
  `Main mass:` = NA_character_,
  `Finder:` = NA_character_,
  `Comments:` = NA_character_,
)

no_cores <- future::availableCores() - 1

future::plan(future::multisession, workers = no_cores)

df <- furrr::future_map_dfr(lookups$Link, ~ tryCatch(get_table(.x), error = function(e) error_df))

colnames(df) <- sub(":", "", colnames(df))

df2 <- df %>%
  mutate(
    `Mass (g)` = gsub(",", "", `Mass (g)`),
    across(c(`Mass (g)`, `Magnetic suscept.`), as.numeric)
  )

if (nrow(df2) == nrow(no_isotopes)) {
  additional_info <- cbind(lookups, df2)
  joint_table$Name <- gsub(" \\*\\*", "", joint_table$Name)
  final_table <- dplyr::left_join(joint_table, additional_info, by = "Name", copy = FALSE)
  write_excel_csv(x = final_table, file = "joint.csv", col_names = T, na = "")
}

^{Created on 2021-02-27 by the reprex package (v0.3.0)}

N.B.

OP had problems with lookups variable for some reason so here is an alternative I wrote that worked for them:

lookups <- map_df(
  webpage %>% html_node("#maintable") %>% html_nodes("td:nth-of-type(1) a") , ~
    data.frame(
      Name = .x %>% html_text(),
      Link =  paste0(base, gsub("\\s+", "%20", .x %>%  html_attr("href")))
    )
) %>% as_tibble()

Thanks a lot for this. I was working with the data in the past weeks and noticed that including the values "Shock stage, Fayalite (mol%), Ferrosilite (mol%), Wollastonite (mol%) and Magnetic suscept.", which can be found when clicking on some meteorites, could be useful as well. Unfortunately, I am not sufficiently familiar with scraping HTML to do this. Is there a quick way to include those values in the data frame as well? — Luca, Feb 26 '21 at 08:56
Can you provide some examples please? Urls? I don't know if that is straightforward and may need a new question. Isotopes was easy because it is part of the dropdown options on the main page. — QHarr, Feb 26 '21 at 10:24
Of course. What I mean is that if you navigate from the link in my question to (for example) the first Meteorite "Al Huwaysah 010 **" (https://www.lpi.usra.edu/meteor/metbull.php?sea=&sfor=names&ants=&nwas=&falls=&valids=&stype=contains&lrec=200&map=ge&browse=&country=All&srt=name&categ=Ungrouped+achondrites&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&code=55637) you have a section called "Data from: MB101 Table 0 Line 0:", where additional information for the meteorites can be found. I wanted to add those values, if available, to the corr. observations in the meteorite dataset. — Luca, Feb 26 '21 at 11:41
Thanks you so much for you efforts!! I appreciate it a lot. The only thing is that I get error messages when running the "lookups" code: ***Error: Argument 1 must have names***, this part: `colnames(df) <- sub(":", "", colnames(df))` : ***Error in colnames<-(tmp, value = character(0)) : attempt to set 'colnames' on an object with less than two dimensions*** and the directly following code: ***Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "function"*** ". Do you know if that is a problem with maybe my version of R? — Luca, Feb 28 '21 at 10:38
Code works fine for me `https://filebin.net/j94lshc5o9d610x0` . I think, for some reason, df is empty for you? Did you run the reprex as is? Did you run it in an IDE? — QHarr, Feb 28 '21 at 13:15
you can use map_dfr instead and see if df populated. Also, use the screenshot above to check the expected dimensions of each variable. — QHarr, Feb 28 '21 at 13:24
I notice I didn't detail library(stringr) import but code ran fine. Weird. I added it into the above. My bad habit of sometimes running test code in the console. — QHarr, Feb 28 '21 at 17:29
You're absolutely right, df is empty for me (tibble 0x0). I also tried to check map_df as you suggested, but it is not populated. I think all is traceable to the first error I got, namely "Error: Argument 1 must have names" - i.e. there seems to be a problem with the "lookups" part of the code, and I can't figure out what it is. I tried to run it on Nuvolos in a cloud-based constantly updated R environment, and it did not work either. I also imported stringr as you suggested, and it did not help. I will keep trying solutions and will let you know asap whether I find something. — Luca, Mar 01 '21 at 22:13
Have you tested lookups by building it up from parts to see which bit fails? E.g. step 1 run `webpage %>% html_node("#maintable") ` and review; step 2 run `webpage %>% html_node("#maintable") %>% html_nodes("td:nth-of-type(1) a")` etc.... — QHarr, Mar 01 '21 at 22:15
Yes, i did. It works exactly up to the point where you have written in the example above. Running it with the following line produces the error "`Error in UseMethod("xml_find_first") : no applicable method for 'xml_find_first' applied to an object of class "character"` ". And running the whole thing results in the error that I have already outlined above. — Luca, Mar 02 '21 at 13:15
So adding `%>% map_df(~ c(html_text(.), html_attr(., "href"))` is where it goes wrong? — QHarr, Mar 02 '21 at 15:43
No, that line still works! With ` set_names("Name", "Link")) ` the code throws an error. I am still trying to figure out why running it until and with `set_names("Name", "Link")) ` gives a different error than running the whole lookups part of the code. — Luca, Mar 02 '21 at 17:53
This is more straightforward I guess `https://pastebin.com/x1udMZyK` — QHarr, Mar 02 '21 at 19:10
Wow, now it works! Thank you so much for your help and efforts! Side note: can you recommend a comprehensive and deep course that teaches web scraping with R? I found some online but they seem rather superficial, and do not go into the detail and depth I would need. — Luca, Mar 02 '21 at 22:09
I found the best resources to be looking at web-scraping techniques across languages here on Stackoverflow e.g. a bit crazy but I literally went through every single web-scraping question in one of the smaller tags - still took months. I then moved into looking at 4-5 other languages and web-scraping getting a feel for techniques. In addition to that, familiarize yourself with basic html + css. See some of the links in my profile page. And of course, learn TidyVerse and web-scraping packages. Look at async / parallel processing. You can see stuff here then go away and look it up. — QHarr, Mar 02 '21 at 22:33
Just don't spend too much time on stuff if it looks dated. All the web-scraping courses I paid for were rubbish. I would have done better hiring a reputable tutor to go through trickier stuff like async. — QHarr, Mar 02 '21 at 22:36
Thanks a lot for the valuable advice! I will begin doing that then :) — Luca, Mar 02 '21 at 22:40
Of course don't forget to have a look at selenium and basic javascript. — QHarr, Mar 02 '21 at 22:42

Adding values form a website to a table when scraping HTML

1 Answers1