1

I will be more than happy to find out this question is a duplicate, but if so - I can't find that Q&A.

There is this mysterious page from the New York State Department of Health containing "Fatalities by County and Age Group". As the title implies, it contains two tables ("By County"/"By Age Group").

For some strange reason, the data on this page is super-secured. It can't be selected, the page can't be saved and it can't be printed. The data isn't on the page source. I also tried (and failed) to inspect xhr calls for the data.

Obviously, requests and beautifulsoup can't handle it. I tried the usual Selenium incantations (so, unless I'm told otherwise, I won't clutter this question with "what I tried" snippets).

Desire output: the data from those two tables, in any conceivable format.

The only thing I can think of is to take a screenshot and try to ocr the image...

I don't know if it's Selenium, Tableau, the NYS Dep't of Health or just me, but it's time to call in the heavy artillery...

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Try [this](https://filebin.net/mz3mtd7e8cejqkrx) to send a post http requests in order to fetch json content. – asmitu Apr 06 '20 at 15:06
  • 1
    @asmitu that comment is out of knowledge at all. Jack is not new for scraping field. ! if you can call it. do it better than redirecting to other resource – αԋɱҽԃ αмєяιcαη Apr 06 '20 at 15:17

2 Answers2

3

Let me explain for you the scenario:

  1. Website is generating a session id behind that parameter X-Session-Id which is dynamically generated once you visit the main page page index. So i called it via GET request and I've picked it up from the headers response.
  2. I've figured out an POST request which is automatically generated before you hit your desired url which is actually using the session id which we collected before. here is it https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{session id}

  3. Now we can call your target which is https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n.

  4. Now I noticed another XHR request to the back-end API. But before we do the call, We will parse the HTML content for picking up the time object which is responsible on generating the data freshly from the API so we will get an instant data (consider it like a live chat actually). in our case it's behind lastUpdatedAt inside the HTML

  5. I noticed as well that we will need to pickup the recent X-Session-Id generated from our previous POST request.

  6. Now we will make the call using our picked up session to https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{session}

Now we have received the full response. you can parse it or do whatever you want.

import requests
import re


data = {
    'worksheetPortSize': '{"w":1536,"h":1250}',
    'dashboardPortSize': '{"w":1536,"h":1250}',
    'clientDimension': '{"w":1536,"h":349}',
    'renderMapsClientSide': 'true',
    'isBrowserRendering': 'true',
    'browserRenderingThreshold': '100',
    'formatDataValueLocally': 'false',
    'clientNum': '',
    'navType': 'Reload',
    'navSrc': 'Top',
    'devicePixelRatio': '2.5',
    'clientRenderPixelLimit': '25000000',
    'allowAutogenWorksheetPhoneLayouts': 'true',
    'sheet_id': 'NYSDOH%20COVID-19%20Tracker%20-%20Fatalities',
    'showParams': '{"checkpoint":false,"refresh":false,"refreshUnmodified":false}',
    'filterTileSize': '200',
    'locale': 'en_US',
    'language': 'en',
    'verboseMode': 'false',
    ':session_feature_flags': '{}',
    'keychain_version': '1'
}


def main(url):
    with requests.Session() as req:
        r = req.post(url)
        sid = r.headers.get("X-Session-Id")

        r = req.post(
            f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/clear/sessions/{sid}")

        r = req.get(
            "https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n")

        match = re.search(r"lastUpdatedAt.+?(\d+),", r.text).group(1)

        time = '{"featureFlags":"{\"MetricsAuthoringBeta\":false}","isAuthoring":false,"isOfflineMode":false,"lastUpdatedAt":xxx,"workbookId":9}'.replace(
            'xxx', f"{match}")

        data['stickySessionKey'] = time
        nid = r.headers.get("X-Session-Id")

        r = req.post(
            f"https://covid19tracker.health.ny.gov/vizql/w/NYS-COVID19-Tracker/v/NYSDOHCOVID-19Tracker-Fatalities/bootstrapSession/sessions/{nid}", data=data)

        print(r.text)


main("https://covid19tracker.health.ny.gov")
0

I've made a tableau scraper library to extract the data from Tableau worksheets

You can get all data in pandas dataframe for each worksheet with the following code:

from tableauscraper import TableauScraper as TS

url = "https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities"

ts = TS()
ts.loads(url)
dashboard = ts.getWorkbook()

for t in dashboard.worksheets:
    # show worksheet name
    print(f"WORKSHEET NAME : {t.name}")
    # show dataframe for this worksheet
    print(t.data)

Try this on repl.it

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159