1

I'm doing an exercise in scraping data from a website. For example, ZocDoc. I'm trying to get a list of all insurance providers and their plans (You can access this information on their homepage in the insurance dropdown).

It appears that all data is loaded via a <scipt> tag when the page loads. When looking in the network tab there doesn't appear to be any network calls that returns JSON including the plan names. I am able to get all the insurance plans using with the following (It's messy, but it works).

  import requests
  from bs4 import BeautifulSoup as bs
  resp = requests.get('https://zocdoc.com')
  long_str = str(soup.findAll('script')[17].string)
  pop = data.split("Popular Insurances")[1]
  json.loads(pop[pop.find("[["):pop.find("]]")+2])

In the HTML returned there are no insurance plans. I also don't see any requests in the network tab where the plans are sent back (there are a few backbone files). One url looks encoded but I'm not sure that that is it and I'm just overthinking this url.

I've also tried waiting for all the JS to load so the data is in the DOM using dryscrape but still no plans in the HTML.

Is there a way to gather this information without having a crawler click on every insurance provider to get their plans?

user2954587
  • 4,661
  • 6
  • 43
  • 101
  • The url you posted is encoded and it is simply the ordinal number of the text, which you can get back by changing the outer `eval` within the information to a `console.log`. Which will return more functions. So the last line will be: `console.log(eval('String.fromCharCode('+z+')'));})()` instead of `eval(eval('String.fromCharCode('+z+')'));})()`. – Cory Shay Jul 25 '16 at 22:56
  • @CoryShay thanks for pointing that out. It looks like it's for cookies – user2954587 Jul 25 '16 at 23:01

1 Answers1

2

Yes, the list of insurances is kept deep inside the script tag:

insuranceModel = new gs.CarrierGroupedSelect(gs.CarrierGroupedSelect.prototype.parse({
...
primary_options: {
        name: "Popular Insurances",
        group: "primary",
        options: [[300,"Aetna",2,0,1,0],[304,"Blue Cross Blue Shield",2,1,1,0],[307,"Cigna",2,0,1,0],[369,"Coventry Health Care",2,0,1,0],[358,"Medicaid",2,0,1,0],[322,"UniCare",2,0,1,0],[323,"UnitedHealthcare",2,0,1,0]]
    },
    secondary_options: {
        name: "All Insurances",
        group: "secondary",
        options: [[440,"1199SEIU",2,0,1,0],[876,"20/20 Eyecare Plan",2,0,1,1],...]
    }
...

You can, of course, dive into wonderful world of JavaScript code parsing in Python either with regular expressions or Javascript parsers like slimit (example here), but this might result into less hair on the head. Plus, the result solution would be quite fragile.

In this particular case, I think selenium is a much better fit. Complete working example - getting the popular insurances:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.PhantomJS()
driver.maximize_window()

wait = WebDriverWait(driver, 10)
insurance_dropdown = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "I'll choose my insurance later")))
insurance_dropdown.click()

for option in driver.find_elements_by_css_selector("[data-group=primary] + .ui-gs-option-set > .ui-gs-option"):
    print(option.get_attribute("data-value"))

driver.close()

Prints:

Aetna
Blue Cross Blue Shield
Cigna
Coventry Health Care
Medicaid
UniCare
UnitedHealthcare

Note that in this case the headless PhantomJS browser is used, but you can use Chrome or Firefox or other browsers that selenium has an available driver for.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • yes, I can get the insurance providers, but not the plan name. Each of those providers also have plans. That's what I'm having trouble getting – user2954587 Jul 25 '16 at 23:42
  • an example plan is `ActiveCare 2` – user2954587 Jul 25 '16 at 23:43
  • @user2954587 gotcha, this would require clicking the insurances one by one and extracting the plans, I'll think of something and update you later. Thanks. – alecxe Jul 26 '16 at 00:31
  • @user2954587 here is what I've got at this point: https://gist.github.com/alecxe/e13676a368449a5987e8ea3f3cb91675. It is randomly failing and needs tweaking and debugging, hope you can work on that. The dropdown is not exactly greatly handling the quick selenium clicks and there is a need to add artificial delays. Plus, some insurances have no plans - so this is a special case as well. Thanks. – alecxe Jul 26 '16 at 06:16