0

I am developing a function to scrape some temperature data from a website. It works but only for the first day of the month.

This gets the data for month 8 and year 2015. However it only scrapes the first table.

How can I use rvest to collect all the tables in that month?

https://www.timeanddate.com/weather/spain/madrid/historic?month=8&year=2015

library(rvest)
library(dplyr)
library(purrr)

Temps <- function(month, year){
  url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?month=", month, "&year=",year, sep = "")
  temps_obtained <- url %>% 
    read_html() %>% 
    html_table(fill = TRUE) %>% 
    .[[2]] %>% 
    setNames(.[1,]) %>% 
    as_tibble(., .name_repair = "universal") %>% 
    dplyr::slice(., -1) %>% 
    dplyr::slice(., -n())

  return(temps_obtained)
}

map2(.x = 8, .y = 2015, ~Temps(.x, .y))

Edit: I just found this solution (for Python):

Scraping table from website [timeanddate.com]

EDIT: This is what I am currently working with which returns no data:

year = 2019
month = 11
day = 3


month = stringr::str_pad(month, width = 2, pad = 0)
day = stringr::str_pad(day, width = 2, pad = 0)
url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?hd=", year, month, day, sep = "")
temps_obtained <- url %>% 
  html_session() %>% 
  read_html() %>% 
  html_table(fill = TRUE)

EDIT:

I think this solves the problem...

  year = 2019
month = 11
day = 3

month = stringr::str_pad(month, width = 2, pad = 0)
day = stringr::str_pad(day, width = 2, pad = 0)

url <- paste("https://www.timeanddate.com/weather/spain/madrid/historic?hd=", year, month, day, sep = "")
temps_obtained <- url %>% 
  html_session() %>% 
  read_html() %>% 
  html_table(fill = TRUE) %>% 
  .[[2]] %>% 
  setNames(.[1,]) %>% 
  as_tibble(., .name_repair = "universal") %>% 
  dplyr::slice(., -1) %>% 
  dplyr::slice(., -n())

Which returns:

# A tibble: 27 x 9
   Time              ...2  Temp  Weather                 Wind   ...6  Humidity Barometer    Visibility
   <chr>             <chr> <chr> <chr>                   <chr>  <chr> <chr>    <chr>        <chr>     
 1 7:00 amSun, Nov 3 ""    55 °F Passing clouds.         16 mph ↑     88%      "29.62 \"Hg" N/A       
 2 7:30 am           ""    55 °F Passing clouds.         21 mph ↑     88%      "29.62 \"Hg" N/A       
 3 8:00 am           ""    55 °F Broken clouds.          21 mph ↑     88%      "29.62 \"Hg" N/A       
 4 8:30 am           ""    55 °F Broken clouds.          18 mph ↑     88%      "29.65 \"Hg" N/A       
 5 9:00 am           ""    55 °F Drizzle. Broken clouds. 16 mph ↑     94%      "29.68 \"Hg" N/A       
 6 9:30 am           ""    57 °F Broken clouds.          21 mph ↑     82%      "29.71 \"Hg" N/A       
 7 10:00 am          ""    57 °F Broken clouds.          26 mph ↑     63%      "29.71 \"Hg" N/A       
 8 10:30 am          ""    57 °F Scattered clouds.       29 mph ↑     55%      "29.74 \"Hg" N/A       
 9 11:00 am          ""    57 °F Scattered clouds.       17 mph ↑     55%      "29.77 \"Hg" N/A       
10 11:30 am          ""    59 °F Scattered clouds.       20 mph ↑     51%      "29.77 \"Hg" N/A 

Changing the day to a 4 gives me different results.

EDIT: Not working

The function works but for only days since 2017. If I apply the following: it does not work.

  url <- "https://www.timeanddate.com/weather/spain/madrid/historic?hd=20100109"
  temps_obtained <- url %>% 
    html_session() %>% 
    read_html() %>% 
    html_node("table") %>% 
    html_table(fill = TRUE)

Which gives me:

1                                                                                 High
2                                                                                  Low
3                                                                              Average
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
                                                                           Temperature
1                                                              72 °F (Oct 31, 3:30 pm)
2                                                               39 °F (Nov 8, 8:00 am)
3                                                                                56 °F
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
                                                                              Humidity
1                                                               100% (Oct 29, 7:30 am)
2                                                                 36% (Nov 8, 3:00 pm)
3                                                                                  69%
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019
                                                                              Pressure
1                                                          30.27 "Hg (Oct 29, 7:30 am)
2                                                           29.62 "Hg (Nov 3, 7:00 am)
3                                                                            30.00 "Hg
4 * Reported Oct 27 6:00 pm — Nov 11 6:30 pm, Madrid. Weather by CustomWeather, © 2019

Which is not the data I need.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • 1
    If you hover over the links at the bottom of the page you will see an alternative form of the URL (with `?hd=20150801` etc) that will take you to each day individually, so you could rebuild your function to go through all of these. – Andrew Gustar Nov 11 '19 at 17:08
  • I didn't see this! I will give it a go, thanks! – user113156 Nov 11 '19 at 17:10
  • 1
    You might need to set it up as a `html_session` and use `jump_to`, depending on whether it lets you use those URLs directly. – Andrew Gustar Nov 11 '19 at 17:12
  • Thanks, I am almost getting it. – user113156 Nov 11 '19 at 17:20
  • I think, I got it. I just need a `map3()` function to map over the day, month and year now... – user113156 Nov 11 '19 at 17:27
  • 1
    You mean `pmap`... – Andrew Gustar Nov 11 '19 at 17:32
  • probably! I will take a look at it now. – user113156 Nov 11 '19 at 17:33
  • I have one additional question as it looks like the code does not work going back before 2017. I try the following ` url <- "https://www.timeanddate.com/weather/spain/madrid/historic?hd=20100109" temps_obtained <- url %>% html_session() %>% read_html() %>% html_node("table") %>% html_table(fill = TRUE)` I am a little stuck on the `html_node("Table")` part. I want to get the second table but I can only get the first... Any advice on how to do that? – user113156 Nov 11 '19 at 18:58
  • 1
    Leave out the `html_node("table")` - the `html_table` will produce a list of tables and you can then select the second one with `%>% .[[2]]` – Andrew Gustar Nov 11 '19 at 19:05
  • Okay thanks, when I run just the following `temps_obtained <- url %>% html_session() %>% read_html() %>% html_table(fill = TRUE)` It returns a table of no information, I go to the website https://www.timeanddate.com/weather/spain/madrid/historic?month=9&year=2009 for 2009 and I scroll down to when you suggested I take the link from and find that the link exists https://www.timeanddate.com/weather/spain/madrid/historic?hd=20090917. When I put this link into the function, it returns a table of "no information" – user113156 Nov 11 '19 at 19:11
  • Huh I actually see now... when I go to the website https://www.timeanddate.com/weather/spain/madrid/historic?hd=20090917 directly, the table does not exist. However when I go here https://www.timeanddate.com/weather/spain/madrid/historic and scroll down to `Select month` and select "September 2009". Then scroll to the bottom of this page where I see `8 Sep9 Sep10 Sep 11`, I select one of them and there I can see the table with the link telling me that its located at https://www.timeanddate.com/weather/spain/madrid/historic?hd=20090911 which it is not located at... – user113156 Nov 11 '19 at 19:14
  • 2
    Another approach - checking what the website actually asks for from the server when you change the date (using 'Network' in Chrome's Inspect) you can get json tables using a URL of the form `https://www.timeanddate.com/scripts/cityajax.php?n=spain/madrid&mode=historic&hd=20091202&month=12&year=2009&json=1`. This is what populates the daily data table on the main month page. You could go straight to these and pass them through `rjson` or `jsonlite` to convert to dataframes. – Andrew Gustar Nov 11 '19 at 19:24
  • Thanks! I am going to take a look at this approach. – user113156 Nov 11 '19 at 19:26

2 Answers2

2

If you use the network analyzer in your browser you can see what happens when you use the drop-down to select another date. You'll notice that another webpage is being loaded...

https://www.timeanddate.com/scripts/cityajax.php?n=spain/madrid&mode=historic&hd=20150802&month=8&year=2015&json=1

If you want to scrape the table you can run this for each day in August. It will return a JSON and you can use the jsonlite package to convert this to a dataframe. This will be the easiest way to scrape and have the least wasted bandwidth. But it may be a pain to get the JSON into a nice tabular format.

Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
  • Note: This was mentioned by Andrew in comments first. I just didn't see his comment until after I had written this. – Adam Sampson Nov 11 '19 at 20:28
  • Doesn't matter if in comments as comments are temporary. Better to have an answer where future readers can see it. + – QHarr Nov 11 '19 at 21:32
1

Building on the previous observations, here is a procedure for getting all results for a date range into a single dataframe. For me, the JSON version returned non-standard JSON which didn't really work, but if you leave off that argument, you get back a HTML table (minus the opening and closing tags), which you can easily plug into html_table. I'm not sure the 'wind direction' (column X.1) works, but everything else seems to be there...

library(tidyverse)
library(rvest)

dates <- seq.Date(as.Date("2015-01-01"),
                  as.Date("2015-01-05"), by = "day")   #set to your range
result <- map(dates, ~read_lines(paste0("https://www.timeanddate.com/scripts/cityajax.php?n=spain/madrid&mode=historic&hd=",
                                format(., "%Y%m%d"),
                                "&month=",
                                format(., "%m"),
                                "&year=",
                                format(., "%Y")))) %>% #read files (html rather than json)
  map(~paste0("<table>", . ,"</table>")) %>%           #trick it into thinking it is a table
  map(~read_html(.) %>% html_table()) %>%              #extract table as dataframe
  map2_df(dates, ~.x[[1]] %>%                          
            set_names(make.names(.[1,], unique = TRUE)) %>% #sort out column names 
            select(-2) %>%                                  #remove blank column
            slice(2:(n() - 1)) %>%                          #remove first and last row
            mutate(Date = .y)) %>%                          #add date column
  mutate(Time = substr(Time, 1, 5))                         #remove dates from time column

head(result)
   Time  Temp Weather    Wind      X.1 Humidity Barometer Visibility       Date
1 07:00 -3 °C  Clear. No wind <U+2191>      86% 1033 mbar      16 km 2015-01-01
2 07:30 -2 °C  Clear. No wind <U+2191>      86% 1033 mbar      16 km 2015-01-01
3 08:00 -2 °C  Clear. No wind <U+2191>      86% 1033 mbar      16 km 2015-01-01
4 08:30 -3 °C  Clear. No wind <U+2191>      86% 1034 mbar      16 km 2015-01-01
5 09:00 -2 °C  Sunny. No wind <U+2191>      93% 1034 mbar      16 km 2015-01-01
6 09:30  1 °C  Sunny. No wind <U+2191>      75% 1035 mbar      16 km 2015-01-01
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32