10

I'm not very experienced in the world of scraping data, so the problem here may be obvious to some.

What I want is to scrape historical daily weather data from wunderground.com, without paying the API. Maybe it's not possible at all.

My method is simply to use requests.get and save the whole text into a file (code below).

Instead of getting the tables that can be accessed from the web browser (see image below), the result is a file that has almost everything but those tables. Something like this:

Summary
No data recorded
Daily Observations
No Data Recorded

What is weird is that if I save-as the web page with Firefox, the result depends on whether I choose 'web-page, only HTML' or 'web-page, complete': the latter includes the data I'm interested in, the former does not.

Is it possible that this is on purpose so nobody scrapes their data? I just wanted to make sure there is not a workaround this problem.

Thanks in advance, Juan

Note: I tried using the user-agent field to no avail.

# Note: I run > set PYTHONIOENCODING=utf-8 before executing python
import requests

# URL with wunderground weather information for a specific date:
date = '2019-03-12'
url = 'https://www.wunderground.com/history/daily/sd/khartoum/HSSS/date/' + date
r = requests.get(url)

# Write a file to check if the tables ar being retrieved:
with open('test.html', 'wb') as testfile:
    testfile.write(r.text.encode('utf-8'))

Screenshot of the tables I want to scrape.


UPDATE: FOUND A SOLUTION

Thanks to pointing me to the selenium module, it is the exact solution I needed. The code extracts all the tables present on the URL of a given date (as seen when visiting the site normally). It needs modifications in order to be able to scrape over a list of dates and organize the CSV files created.

Note: geckodriver.exe is needed in the working directory.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
import requests, sys, re

# URL with wunderground weather information
url = 'https://www.wunderground.com/history/daily/sd/khartoum/HSSS/date/2019-3-12'

# Commands related to the webdriver (not sure what they do, but I can guess):
bi = FirefoxBinary(r'C:\Program Files (x86)\Mozilla Firefox\\firefox.exe')
br = webdriver.Firefox(firefox_binary=bi)

# This starts an instance of Firefox at the specified URL:
br.get(url)

# I understand that at this point the data is in html format and can be
# extracted with BeautifulSoup:
sopa = BeautifulSoup(br.page_source, 'lxml')

# Close the firefox instance started before:
br.quit()

# I'm only interested in the tables contained on the page:
tablas = sopa.find_all('table')

# Write all the tables into csv files:
for i in range(len(tablas)):
    out_file = open('wunderground' + str(i + 1) + '.csv', 'w')
    tabla = tablas[i]

    # ---- Write the table header: ----
    table_head = tabla.findAll('th')
    output_head = []
    for head in table_head:
        output_head.append(head.text.strip())

    # Some cleaning and formatting of the text before writing:
    encabezado = '"' + '";"'.join(output_head) + '"'
    encabezado = re.sub('\s', '', encabezado) + '\n'
    out_file.write(encabezado.encode(encoding='UTF-8'))

    # ---- Write the rows: ----
    output_rows = []
    filas = tabla.findAll('tr')
    for j in range(1, len(filas)):
        table_row = filas[j]
        columns = table_row.findAll('td')
        output_row = []
        for column in columns:
            output_row.append(column.text.strip())

        # Some cleaning and formatting of the text before writing:
        fila = '"' + '";"'.join(output_row) + '"'
        fila = re.sub('\s', '', fila) + '\n'
        out_file.write(fila.encode(encoding='UTF-8'))

    out_file.close()

Extra: the answer of @QHarr works beautifully, but I needed a couple of modifications to use it, because I use firefox in my PC. It's important to note that for this to work I had to add the geckodriver.exe file into my working directory. Here's the code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://www.wunderground.com/history/daily/sd/khartoum/HSSS/date/2019-03-12'
bi = FirefoxBinary(r'C:\Program Files (x86)\Mozilla Firefox\\firefox.exe')
driver = webdriver.Firefox(firefox_binary=bi)
# driver = webdriver.Chrome()
driver.get(url)
tables = WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table")))
for table in tables:
    newTable = pd.read_html(table.get_attribute('outerHTML'))
    if newTable:
        print(newTable[0].fillna(''))
Juan
  • 1,351
  • 1
  • 14
  • 28
  • 2
    This is a common scraping issue - most modern webpages are heavily reliant on javascript, which requires a VM to execute inside of. When you use requests, or `curl`, all you get is the raw html, without any of the functionality that the javascript provides. A good workaround for scraping is to use the selenium library, which gives you that javascript VM. It's a steep learning curve, but well worth it. – Danielle M. Mar 22 '19 at 19:19
  • 1
    Why don't you want to use the API? Аctually scraping via API much more easy and reliable. – omegastripes Mar 22 '19 at 19:27
  • @omegastripes : money – Juan Mar 22 '19 at 19:30
  • If you really want to scrape a webpage with JavaScript, you might want to use Selenium or something that can run an actual headless browser. – Random Davis Mar 22 '19 at 19:32
  • @Juan You can examine the webpage logged requests in a browser's developer tools and make some reverse engineering job to find out how to use the API for free. – omegastripes Mar 22 '19 at 20:24
  • @omegastripes that sounds like some 1337 h4x0r level stuff. I'd love to understand what you said but I don't quite much. Thanks anyway! – Juan Mar 25 '19 at 19:02
  • 1
    @Juan That is quite simple stuff, take a look at [this](https://yadi.sk/i/WtigmOYwnnER_w) – omegastripes Mar 25 '19 at 21:23
  • Very cool @omegastripes! I'll give it a try one of this days. Thank you! – Juan Mar 26 '19 at 16:13
  • How do you bypass the `ValueError: No tables found matching pattern '.+'` error that comes up from @Qharr? – Rivers31334 Dec 20 '19 at 14:36

4 Answers4

5

you could use selenium to ensure page load then pandas read_html to get tables

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

url = 'https://www.wunderground.com/history/daily/sd/khartoum/HSSS/date/2019-03-12'
driver = webdriver.Chrome()
driver.get(url)
tables = WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table")))
for table in tables:
    newTable = pd.read_html(table.get_attribute('outerHTML'))
    if newTable:
        print(newTable[0].fillna(''))
QHarr
  • 83,427
  • 12
  • 54
  • 101
4

They have added some additional tables at the top, just searching with table will not work now, I have used the class selector with the class name to fetch the record, it's working fine

tables = WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "mat-table.cdk-table.mat-sort.ng-star-inserted")))
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
2

Another direction: Use the API calls that the website is doing.

(The HTTP call was taken from Chrome developer tools)

Example:

HTTP GET https://api-ak.wunderground.com/api/d8585d80376a429e/history_20180812/lang:EN/units:english/bestfct:1/v:2.0/q/HSSS.json?showObs=0&ttl=120

Response

{
    "response": {
        "version": "2.0",
        "units": "english",
        "termsofService": "https://www.wunderground.com/weather/api/d/terms.html",
        "attribution": {
        "image":"//icons.wxug.com/graphics/wu2/logo_130x80.png",
        "title":"Weather Underground",
        "link":"http://www.wunderground.com"
        },
        "features": {
        "history": 1
        }
        , "location": {
        "name": "Khartoum",
        "neighborhood":null,
        "city": "Khartoum",
        "state": null,
        "state_name":"Sudan",
        "country": "SD",
        "country_iso3166":"SA",
        "country_name":"Saudi Arabia",
        "continent":"AS",
        "zip":"00000",
        "magic":"474",
        "wmo":"62721",
        "radarcode":"xxx",
        "radarregion_ic":null,
        "radarregion_link": "//",
        "latitude":15.60000038,
        "longitude":32.54999924,
        "elevation":null,
        "wfo": null,
        "l": "/q/zmw:00000.474.62721",
        "canonical": "/weather/sa/khartoum"
        },
        "date": {
    "epoch": 1553287561,
    "pretty": "11:46 PM EAT on March 22, 2019",
    "rfc822": "Fri, 22 Mar 2019 23:46:01 +0300",
    "iso8601": "2019-03-22T23:46:01+0300",
    "year": 2019,
    "month": 3,
    "day": 22,
    "yday": 80,
    "hour": 23,
    "min": "46",
    "sec": 1,
    "monthname": "March",
    "monthname_short": "Mar",
    "weekday": "Friday",
    "weekday_short": "Fri",
    "ampm": "PM",
    "tz_short": "EAT",
    "tz_long": "Africa/Khartoum",
    "tz_offset_text": "+0300",
    "tz_offset_hours": 3.00
}
    }
        ,
"history": {
    "start_date": {
    "epoch": 1534064400,
    "pretty": "12:00 PM EAT on August 12, 2018",
    "rfc822": "Sun, 12 Aug 2018 12:00:00 +0300",
    "iso8601": "2018-08-12T12:00:00+0300",
    "year": 2018,
    "month": 8,
    "day": 12,
    "yday": 223,
    "hour": 12,
    "min": "00",
    "sec": 0,
    "monthname": "August",
    "monthname_short": "Aug",
    "weekday": "Sunday",
    "weekday_short": "Sun",
    "ampm": "PM",
    "tz_short": "EAT",
    "tz_long": "Africa/Khartoum",
    "tz_offset_text": "+0300",
    "tz_offset_hours": 3.00
},
    "end_date": {
    "epoch": null,
    "pretty": null,
    "rfc822": null,
    "iso8601": null,
    "year": null,
    "month": null,
    "day": null,
    "yday": null,
    "hour": null,
    "min": null,
    "sec": null,
    "monthname": null,
    "monthname_short": null,
    "weekday": null,
    "weekday_short": null,
    "ampm": null,
    "tz_short": null,
    "tz_long": null,
    "tz_offset_text": null,
    "tz_offset_hours": null
},
    "days": [
        {
        "summary": {
        "date": {
    "epoch": 1534021200,
    "pretty": "12:00 AM EAT on August 12, 2018",
    "rfc822": "Sun, 12 Aug 2018 00:00:00 +0300",
    "iso8601": "2018-08-12T00:00:00+0300",
    "year": 2018,
    "month": 8,
    "day": 12,
    "yday": 223,
    "hour": 0,
    "min": "00",
    "sec": 0,
    "monthname": "August",
    "monthname_short": "Aug",
    "weekday": "Sunday",
    "weekday_short": "Sun",
    "ampm": "AM",
    "tz_short": "EAT",
    "tz_long": "Africa/Khartoum",
    "tz_offset_text": "+0300",
    "tz_offset_hours": 3.00
},
        "temperature": 82,
    "dewpoint": 66,
    "pressure": 29.94,
    "wind_speed": 11,
    "wind_dir": "SSE",
    "wind_dir_degrees": 166,
    "visibility": 5.9,
    "humidity": 57,
    "max_temperature": 89,
    "min_temperature": 75,
    "temperature_normal": null,
    "min_temperature_normal": null,
    "max_temperature_normal": null,
    "min_temperature_record": null,
    "max_temperature_record": null,
    "min_temperature_record_year": null,
    "max_temperature_record_year": null,
    "max_humidity": 83,
    "min_humidity": 40,
    "max_dewpoint": 70,
    "min_dewpoint": 63,
    "max_pressure": 29.98,
    "min_pressure": 29.89,
    "max_wind_speed": 22,
    "min_wind_speed": 5,
    "max_visibility": 6.2,
    "min_visibility": 1.9,
    "fog": 0,
    "hail": 0,
    "snow": 0,
    "rain": 1,
    "thunder": 0,
    "tornado": 0,
    "snowfall": null,
    "monthtodatesnowfall": null,
    "since1julsnowfall": null,
    "snowdepth": null,
    "precip": 0.00,
    "preciprecord": null,
    "preciprecordyear": null,
    "precipnormal": null,
    "since1janprecipitation": null,
    "since1janprecipitationnormal": null,
    "monthtodateprecipitation": null,
    "monthtodateprecipitationnormal": null,
    "precipsource": "3Or6HourObs",
    "gdegreedays": 32,
    "heatingdegreedays": 0,
    "coolingdegreedays": 17,
    "heatingdegreedaysnormal": null,
    "monthtodateheatingdegreedays": null,
    "monthtodateheatingdegreedaysnormal": null,
    "since1sepheatingdegreedays": null,
    "since1sepheatingdegreedaysnormal": null,
    "since1julheatingdegreedays": null,
    "since1julheatingdegreedaysnormal": null,
    "coolingdegreedaysnormal": null,
    "monthtodatecoolingdegreedays": null,
    "monthtodatecoolingdegreedaysnormal": null,
    "since1sepcoolingdegreedays": null,
    "since1sepcoolingdegreedaysnormal": null,
    "since1jancoolingdegreedays": null,
    "since1jancoolingdegreedaysnormal": null
,
        "avgoktas": 5,
        "icon": "rain"
        }
        }
    ]
}
}
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Hi, thanks for this, but unfortunately I cannot understand this with my current knowledge. How would I run those commands? Is it windows cmd, unix command line, something else? – Juan Mar 25 '19 at 18:57
1

I do it in the following manner.

I open developer tools using Ctrl+Shift+I Then I submit a request via the website while recording the transactions (In this case, you just click on the View button. Then I filter those for XHR. enter image description here

In the requests that remain, I go over the responses for each request that remain. The response that looked like the one I wanted I take its request URL and use it. It might be best to copy the response to a separate JSON file and beautify it so it's easy to read and determine if that's what you want.

In my scenario, my request URL was a get request to the following https://api.weather.com/v1/location/OLBA:9:LB/observations/historical.json?apiKey=_____________&units=e&startDate=20200305

I removed the API KEY from the URL above in order for me to use it

When you paste the URL into a browser you should get the same response and then you can use Python requests package to get the response and just parse the JSON.

Ebrahim Karam
  • 841
  • 1
  • 11
  • 22