2

I'm trying to scrape weather data (in R) for the 2nd of March on the following web page: https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 I am interested in the table at the end, below "Stockholm Weather History for..."

Just above and to the right of that table is a drop-down list where I chose the 2nd of March. But when I scrape using rselenium I only get the data of the 1st of March. How can I get the data for the 2nd (and any other date except the 1st) I have also tried to scrape the entire page using read_html but I can't find a way to extract the data I want from that.

The following code is the one that only seem to work for the 1st but any other date in the month.

library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome", port=4234L, chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name", value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)
hasse
  • 21
  • 1
  • Is this a personal one-off? If so, you could look at the free trial of their API? – QHarr Jan 05 '21 at 14:50
  • This answer is doing the same thing in Python: https://stackoverflow.com/questions/51756775/scraping-table-from-website-timeanddate-com – Ian Campbell Jan 05 '21 at 15:12

2 Answers2

2

Here's an approach with RSelenium

library(RSelenium)
library(rvest)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
client$findElement(using = "link text","Mar 2")$clickElement()
source <- client$getPageSource()[[1]]
read_html(source) %>%
   html_node(xpath = '//*[@id="wt-his"]') %>%
   html_table %>%
   head
                     Conditions Conditions      Conditions Comfort Comfort  Comfort                     
1               Time                  Temp         Weather    Wind         Humidity Barometer Visibility
2 12:20 amMon, Mar 2                 39 °F         Chilly.   7 mph       ↑      87% 29.18 "Hg        N/A
3           12:50 am                 37 °F         Chilly.   7 mph       ↑      87% 29.18 "Hg        N/A
4            1:20 am                 37 °F Passing clouds.   7 mph       ↑      87% 29.18 "Hg        N/A
5            1:50 am                 37 °F Passing clouds.   7 mph       ↑      87% 29.18 "Hg        N/A
6            2:20 am                 37 °F       Overcast.   8 mph       ↑      87% 29.18 "Hg        N/A

You can then iterate over dates for findElement().

You can find the xpath by right clicking on the table and choosing Inspect in Chrome: enter image description here

Then, you can find the table element, right click and choose Copy > Copy XPath.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
0

It is always useful to use your browser's "developer tools" to inspect the web page and figure out how to extract the information you need.

A couple of tutorials that explain this I just googled:

https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47 https://www.scrapingbee.com/blog/web-scraping-r/

For example, in this particular webpage, when we select a new date in the dropdown list, the webpage sends a GET request to the server, which returns a JSON string with the data of the requested date. Then the webpage updates the data in the table (probably using javascript -- did not check this).

So, in this case you need to emulate this behavior, capture the json file and parse the info in it.

In Chrome, if you look at the developer tool network pane, you will see that the address of the GET request is of the form:

https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=YYYYMMDD&month=M&year=YYYY&json=1

where YYYY stands for year with 4 digits, MM(M) month with two (one) digits, and DD day of the month with two digits.

So you can set your code to do the GET request directly to this address, get the json response and parse it accordingly.

library(rjson)
library(rvest)
library(plyr)
library(dplyr)

year <- 2020
month <- 3
day <- 7

# create formatted url with desired dates
url <- sprintf('https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=%4d%02d%02d&month=%d&year=%4d&json=1', year, month, day, month, year)

webpage <- read_html(url) %>% html_text()

# json string is not formatted the way fromJSON function needs
# so I had to parse it manually

# split string on each row
x <- strsplit(webpage, "\\{c:")[[1]]
# remove first element (garbage)
x <- x[2:length(x)]
# clean last 2 characters in each row
x <- sapply(x, FUN=function(xx){substr(xx[1], 1, nchar(xx[1])-2)}, USE.NAMES = FALSE)

# function to get actual data in each row and put it into a dataframe

parse.row <- function(row.string) {
  # parse columns using '},{' as divider
  a <- strsplit(row.string, '\\},\\{')[[1]]
  # remove some lefover characters from parsing
  a <- gsub('\\[\\{|\\}\\]', '', a)
  # remove what I think is metadata
  a <- gsub('h:', '', gsub('s:.*,', '', a))
  
  df <- data.frame(time=a[1], temp=a[3], weather=a[4], wind=a[5], humidity=a[7],
                   barometer=a[8])
  
  return(df)
}

# use ldply to run function parse.row for each element of x and combine the results in a single dataframe
df.final <- ldply(x, parse.row)

Result:

> head(df.final)
                  time    temp           weather      wind humidity     barometer
1 "12:20 amSat, Mar 7" "28 °F" "Passing clouds." "No wind"   "100%" "29.80 \\"Hg"
2           "12:50 am" "28 °F" "Passing clouds." "No wind"   "100%" "29.80 \\"Hg"
3            "1:20 am" "28 °F" "Passing clouds."   "1 mph"   "100%" "29.80 \\"Hg"
4            "1:50 am" "30 °F" "Passing clouds."   "2 mph"   "100%" "29.80 \\"Hg"
5            "2:20 am" "30 °F" "Passing clouds."   "1 mph"   "100%" "29.80 \\"Hg"
6            "2:50 am" "30 °F"     "Low clouds." "No wind"   "100%" "29.80 \\"Hg"

I left everything as strings in the data frame, but you can convert the columns to numeric or dates with you need.

kikoralston
  • 1,176
  • 5
  • 6