how to scrape text from a HTML body

Question

I've never scraped. Would it be straightforward to scrape the text in the main, big gray box only from the link below (starting with header SRUS43 KMSR 271039, ending with .END)? My end goal is to basically have three tidy columns of data from all that text: the five digit codes, the values in inches, and the basin elevation descriptions, so any pointers with processing the text format are welcome, too.

https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6

thank you for any help.

Possible duplicate of [Is there a simple way in R to extract only the text elements of an HTML page?](https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page) — divibisan, Mar 27 '19 at 18:54

score 2 · Accepted Answer · answered Mar 27 '19 at 19:32

Reading in the text is fairly easy (see @DiceBoyT answer). Cleaning up the format for three columns is a bit more involved. Below could use some clean-up (especially with the regex), but it gets the job done:

library(tidyverse)
library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

df <- tibble(txt = read_lines(text))

df %>%
  mutate(
    row = row_number(),
    with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
    wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
    basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
  ) %>% 
  separate(with_code, c("code", "val"), sep = "\\s+") %>% 
  mutate(
    combined_val = case_when(
      !is.na(val) ~ val,
      !is.na(wo_code) ~ wo_code,
      TRUE ~ NA_character_
    ) %>% as.numeric
  ) %>%
  filter(!is.na(combined_val)) %>%
  mutate(
    code = zoo::na.locf(code),
    basin_desc = zoo::na.locf(basin_desc)
  ) %>%
  select(
    code, combined_val, basin_desc
  )
#> # A tibble: 643 x 3
#>    code  combined_val basin_desc               
#>    <chr>        <dbl> <chr>                    
#>  1 ACSC1          0   San Antonio Ck - Sunol   
#>  2 ADLC1          0   Arroyo De La Laguna      
#>  3 ADOC1          0   Santa Ana R - Prado Dam  
#>  4 AHOC1          0   Arroyo Honda nr San Jose 
#>  5 AKYC1         41   SF American nr Kyburz    
#>  6 AKYC1          3.2 SF American nr Kyburz    
#>  7 AKYC1         42.2 SF American nr Kyburz    
#>  8 ALQC1          0   Alamo Canal nr Pleasanton
#>  9 ALRC1          0   Alamitos Ck - Almaden Res
#> 10 ANDC1          0   Coyote Ck - Anderson Res 
#> # ... with 633 more rows

^{Created on 2019-03-27 by the reprex package (v0.2.1)}

Wow, thank you! I'm probably asking too much, but what if I wanted the `basin_desc` to be the elevation description instead of the location description, such as: `Entire Basin`, `Base to 5000'`, `5000' to Top`, etc? — dbo, Mar 27 '19 at 22:55
For the comment above, I worked through each step of @JasonAizkalns and came up successfully with: `df <- df %>% mutate(elevation_zone = gsub(".*(inches))","",txt)) ` — dbo, Mar 28 '19 at 01:42

score 1 · Answer 2 · answered Mar 27 '19 at 18:28

1

This is pretty straightforward to scrape with rvest:

library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text()

answered Mar 27 '19 at 18:28

dave-edison

3,666
7
19

glad to hear, though I'm getting `Error in open.connection(x, "rb") : Could not resolve host: www.nohrsc.noaa.gov Calls: %>% -> eval -> eval -> read_html -> read_html.default Execution halted` – dbo Mar 27 '19 at 18:45
1

Try restarting R and running the code, it works fine for me with a fresh R session. – dave-edison Mar 27 '19 at 18:56

how to scrape text from a HTML body

2 Answers2