2

I want to get the historical hourly weather data from https://www.timeanddate.com/

This is the website link:https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016 - Here I am selecting February and 2016, and the result will appear in the bottom of the page.

I used the following steps:https://stackoverflow.com/a/47280970/9341589

and it is working perfectly on the "first day of each month", I want to parse all the month, and if it is possible all the year.

below the code I am using (to parse March 1, 2016):

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/weather/usa/dayton/historic?month=3&year=2016"
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")

Data = []
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
   dict = {}
   dict['time'] = tr.find('th').text.strip()
   all_td = tr.find_all('td')
   dict['temp'] = all_td[1].text
   dict['weather'] = all_td[2].text
   dict['wind'] = all_td[3].text
   arrow = all_td[4].text


   dict['humidity'] = all_td[5].text
   dict['barometer'] = all_td[6].text
   dict['visibility'] = all_td[7].text

   Data.append(dict)

this is the result for March 1:RESULT

This is because the website "url", the link only include the month and year, and to change the days, for instance, from Feb1 to Feb 3, the tab is shown in the pic attached needed to be used:TAB to select day

King Julien
  • 159
  • 17

2 Answers2

1

You can iterate over the table elements (tr, th, and td) for a single page:

import requests, re, typing
from bs4 import BeautifulSoup as soup
import contextlib
def _remove(d:list) -> list:
   return list(filter(None, [re.sub('\xa0', '', b) for b in d]))

@contextlib.contextmanager
def get_weather_data(url:str, by_url = True) -> typing.Generator[dict, None, None]:
   d = soup(requests.get(url).text if by_url else url, 'html.parser')
   _table = d.find('table', {'id':'wt-his'})
   _data = [[[i.text for i in c.find_all('th')], *[i.text for i in c.find_all('td')]] for c in _table.find_all('tr')]
   [h1], [h2], *data, _ = _data
   _h2 = _remove(h2)
   yield {tuple(_remove(h1)):[dict(zip(_h2, _remove([a, *i]))) for [[a], *i] in data]}


with get_weather_data('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') as weather:
 print(weather)

Output:

{('Conditions', 'Comfort'): [{'Time': '12:58 amMon, Feb 1', 'Temp': '50°F', 'Weather': 'Light rain. Mostly cloudy.', 'Wind': '13 mph', 'Humidity': '↑', 'Barometer': '88%', 'Visibility': '29.79 "Hg'}, {'Time': '1:58 am', 'Temp': '46°F', 'Weather': 'Mostly cloudy.', 'Wind': '12 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.82 "Hg'}, {'Time': '2:58 am', 'Temp': '43°F', 'Weather': 'Mostly cloudy.', 'Wind': '14 mph', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '29.87 "Hg'}, {'Time': '3:58 am', 'Temp': '42°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.89 "Hg'}, {'Time': '4:58 am', 'Temp': '41°F', 'Weather': 'Mostly cloudy.', 'Wind': '10 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.91 "Hg'}, {'Time': '5:58 am', 'Temp': '39°F', 'Weather': 'Mostly cloudy.', 'Wind': '8 mph', 'Humidity': '↑', 'Barometer': '83%', 'Visibility': '29.93 "Hg'}, {'Time': '6:58 am', 'Temp': '38°F', 'Weather': 'Partly cloudy.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '82%', 'Visibility': '29.96 "Hg'}, {'Time': '7:58 am', 'Temp': '38°F', 'Weather': 'Partly sunny.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '80%', 'Visibility': '29.99 "Hg'}, {'Time': '8:58 am', 'Temp': '38°F', 'Weather': 'Overcast.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '78%', 'Visibility': '30.01 "Hg'}, {'Time': '9:58 am', 'Temp': '40°F', 'Weather': 'Broken clouds.', 'Wind': '7 mph', 'Humidity': '↑', 'Barometer': 'N/A', 'Visibility': '30.01 "Hg'}, {'Time': '10:58 am', 'Temp': '41°F', 'Weather': 'Broken clouds.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '72%', 'Visibility': '30.02 "Hg'}, {'Time': '11:58 am', 'Temp': '41°F', 'Weather': 'Partly sunny.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '70%', 'Visibility': '30.04 "Hg'}, {'Time': '12:58 pm', 'Temp': '42°F', 'Weather': 'Scattered clouds.', 'Wind': '2 mph', 'Humidity': '↑', 'Barometer': '69%', 'Visibility': '30.04 "Hg'}, {'Time': '1:58 pm', 'Temp': '43°F', 'Weather': 'Partly sunny.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.03 "Hg'}, {'Time': '2:58 pm', 'Temp': '44°F', 'Weather': 'Partly sunny.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '62%', 'Visibility': '30.02 "Hg'}, {'Time': '3:58 pm', 'Temp': '46°F', 'Weather': 'Passing clouds.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '58%', 'Visibility': '30.03 "Hg'}, {'Time': '4:58 pm', 'Temp': '46°F', 'Weather': 'Sunny.', 'Wind': '6 mph', 'Humidity': '↑', 'Barometer': '57%', 'Visibility': '30.04 "Hg'}, {'Time': '5:58 pm', 'Temp': '43°F', 'Weather': 'Clear.', 'Wind': '3 mph', 'Humidity': '↑', 'Barometer': '65%', 'Visibility': '30.06 "Hg'}, {'Time': '6:58 pm', 'Temp': '39°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '71%', 'Visibility': '30.09 "Hg'}, {'Time': '7:58 pm', 'Temp': '35°F', 'Weather': 'Clear.', 'Wind': '1 mph', 'Humidity': '↑', 'Barometer': '79%', 'Visibility': '30.11 "Hg'}, {'Time': '8:58 pm', 'Temp': '32°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '85%', 'Visibility': '30.13 "Hg'}, {'Time': '9:58 pm', 'Temp': '30°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '91%', 'Visibility': '30.14 "Hg'}, {'Time': '10:58 pm', 'Temp': '28°F', 'Weather': 'Clear.', 'Wind': '5 mph', 'Humidity': '↑', 'Barometer': '93%', 'Visibility': '30.14 "Hg'}, {'Time': '11:58 pm', 'Temp': '29°F', 'Weather': 'Clear.', 'Wind': 'No wind', 'Humidity': '↑', 'Barometer': '90%', 'Visibility': '30.13 "Hg'}]}

However, in order to scrape the data for all days in the desired month, selenium must be used, as the site dynamically updates the DOM via a request to the backend:

from selenium import webdriver
d = webdriver.Chrome('/Path/to/chromedriver')
d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016')
_d = {}
for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'):
  i.click()
  with get_weather_data(d.page_source, False) as weather:
    _d[i.text] = weather

Edit: to iterate over the full data results, use dict.items:

for a, b in _d.items():
  pass #do something with a and b
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I am still getting ( TypeError: 'str' object is not callable) using your code, I appreciate your help – King Julien Aug 09 '18 at 00:32
  • after re-running the code I get:TypeError: Parameters to generic types must be types. Got {'time': '11:58 pm', 'temp': '27\xa0°F', 'weather': 'Partly cloudy.', 'wind': '21 mph', 'humidity': . – King Julien Aug 09 '18 at 00:44
  • I am trying to get all of the tables inside the tab, thanks – King Julien Aug 09 '18 at 00:46
  • 1
    @KingJulien Strange, I do not receive that error when I run this code. I am using Python 3.7. What version are you running this code on? – Ajax1234 Aug 09 '18 at 00:49
  • my python version is 3.6.4. the code now running but it is still only providing the result for Feb 1. – King Julien Aug 09 '18 at 00:55
  • 1
    @KingJulien That is also what I am receiving. How does it match your desired output? – Ajax1234 Aug 09 '18 at 01:09
  • dear @ Ajax1234, I think we are very close buddy, I am just looking for chromedriver path. – King Julien Aug 09 '18 at 01:11
  • @ Ajax1234 It will be great if I can save all of the month values to csv – King Julien Aug 09 '18 at 01:13
  • 1
    @KingJulien You need to install the chromedriver from here: http://chromedriver.chromium.org/downloads and pass the path pointing to the installation to `Chrome` – Ajax1234 Aug 09 '18 at 01:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177691/discussion-between-king-julien-and-ajax1234). – King Julien Aug 09 '18 at 01:25
  • @ Ajax1234 this is my notebook, I attached the error messages I am getting now: https://drive.google.com/open?id=1WS72xVIEN4M1xXmpxiR6pl0oAfKwxdFs – King Julien Aug 09 '18 at 01:30
  • @ Ajax1234 I did install chromedriver now, the error I am getting is after the installation – King Julien Aug 09 '18 at 01:32
  • this is how the updated script look like, still not solved the error: from selenium import webdriver d = webdriver.Firefox(executable_path=r'C:/Users/KJ/Downloads/geckodriver.exe') d.get('https://www.timeanddate.com/weather/usa/dayton/historic?month=2&year=2016') for i in d.find_element_by_id('wt-his-select').find_elements_by_tag_name('option'): i.click() with get_weather_data(d.page_source) as weather: print(weather) – King Julien Aug 09 '18 at 05:00
  • I will appreciate if you modify the answer to include the iteration you mentioned. Thank you for your time. – King Julien Aug 09 '18 at 18:52
0

Using the developer tools in chrome, it looks like you can search for and click a link with text first_three_letters_of_month day using driver.find_element_by_link_text(date_here).click()

Mason Caiby
  • 1,846
  • 6
  • 17