0

I am trying to load some data from this web page. The part of info that I want to get it´s this specific part:

enter image description here

I inspected the page and I see this class&id:

enter image description here

So I tried like this:

url = url(paste0("http://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna"))
aa2 = html_nodes(read_html(url),
                 'div#listado-avisos.contenedor-tabla')

aa3 = data.frame(texto = str_replace_all(html_text(aa2),"[\r\n\t]" , ""),
                 stringsAsFactors = FALSE)

And I get a dataframe with a row without any info... What I am doing wrong?

Thanks in advance.

Updated: possible answer thanks to QHarr:

library(httr)
library(rvest)
library(jsonlite)
url = "https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
date_value <- read_html("scrapedpage.html") %>% html_node('#fecha-seleccionada-origen') %>% html_attr('value')

url2 = paste0('https://www.aemet.es/es/api-eltiempo/resumen-avisos-geojson/PB/', date_value , '/D+1')
download.file(url2, destfile = "scrapedpage2.html", quiet=TRUE)

data <- httr::GET(url = "scrapedpage2.html", httr::add_headers(.headers=headers)) 

avisos = jsonlite::parse_json(read_html("scrapedpage2.html") %>%
  html_node('p') %>% html_text())
GonzaloReig
  • 77
  • 1
  • 6

1 Answers1

1

It is dynamically populated. If you don't mind some very minor differences you can issue two requests. One to the initial url to pick up a timestamp value; then issue an API request (as the page does) adding in the previously retrieved timestamp so as to get predictions for right period. Parse response to get at json holding the avisos

library(httr)
library(rvest)
library(jsonlite)

headers = c('Referer' = 'https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna')

date_value <- read_html('https://www.aemet.es/es/eltiempo/prediccion/avisos?w=mna') %>% html_node('#fecha-seleccionada-origen') %>% html_attr('value')

data <- httr::GET(url = paste0('https://www.aemet.es/es/api-eltiempo/resumen-avisos-geojson/PB/', date_value , '/D+1'), httr::add_headers(.headers=headers)) 

avisos <- jsonlite::parse_json(read_html(data$content) %>% html_node('p') %>% html_text())$objects$Avisos$geometries
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hi QHarr. I think that your system works pretty well, but I have an extra error that previously I didn´t have: "Error in open.connection(x, "rb") : Timeout was reached: [www.aemet.es] Connection timed out after 10001 milliseconds". I tried to avoid it using [this](https://stackoverflow.com/questions/36043172/package-rvest-for-web-scraping-https-site-with-proxy/38463559#38463559), but I get a mistake in the data part (I modified the post including the code) – GonzaloReig Apr 13 '20 at 09:37
  • 1
    Hi, I will have a look later today – QHarr Apr 13 '20 at 10:13