2

I would like to scrape daily COVID-19 Data from the Washington State Department of Health Dashboard (https://www.doh.wa.gov/Emergencies/NovelCoronavirusOutbreak2020COVID19/DataDashboard) using Python.

The site has an embedded Power BI dashboard. Some simple inspection reveals that the site is requesting a specific view from a Power BI site (https://app.powerbigov.us/view?...). This view argument changes daily as dashboard data is updated. I had been using a simple request.get to query this address, but I cannot capture the changing view argument from the Department of Health site with this package alone as the page renders in JavaScript. I have tried the following Selenium Code (Ubuntu, Chromium) but despite my efforts to wait until the relevant iframe is rendered, I get a timeout message:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

target_url = 'http://www.doh.wa.gov/Emergencies/NovelCoronavirusOutbreak2020COVID19/DataDashboard'
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--remote-debugging-port=9222')

driver = webdriver.Chrome(options=chrome_options)  

driver.get(target_url)

wait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"CovidDashboardFrame")))

TimeoutException: Message: timeout: Timed out receiving message from renderer: 300.000 (Session info: headless chrome=83.0.4103.61)

Without the frame switching, a blank page is returned. I have tested my set up with another site (www.google.com) and am able to retrieve the source code - there is something about this particular site.

Thank you very much for your help.

eat3donuts
  • 21
  • 1
  • 1
    chrome driver with Mozilla as user agent - I like it)))) – Igor Dragushhak Jun 23 '20 at 23:20
  • first run without `--headless` to see what you get in browser and use `DevTools` to see HTML. – furas Jun 24 '20 at 00:17
  • BTW: Johns Hopkins University keeps coronavirus data on GitHub as CSV files and this doesn't need Selenium to read it: https://github.com/CSSEGISandData/COVID-19 – furas Jun 24 '20 at 00:20
  • I don't know what data you need from Power BI but there is also page with `Summary Data Tables` and some Excel file https://www.doh.wa.gov/emergencies/coronavirus – furas Jun 24 '20 at 00:30
  • @eat3donuts Code block to _scrape daily COVID-19 Data from the Washington State Department of Health Dashboard_? – undetected Selenium Jun 24 '20 at 08:24
  • Thanks for the suggestions @furas - the other sources do not have all the data I need, and the headless option seems to be required for my chromium driver to function on any site. – eat3donuts Jun 24 '20 at 16:33
  • use `--headless` when all code will work correctly. But when you create code then run without `--headless` to see what browser get from server, and to use `DevTool` in browser to get `XPath` or `CSS Selector` which you can use in code - this way you can control/debug if code is correct. – furas Jun 24 '20 at 19:18

0 Answers0