-1

I've seen a couple of posts with this same question but their scripts usually waits until one of the elements (buttons) is clickable. Here is the table I'm trying to scrape:

https://ropercenter.cornell.edu/presidential-approval/highslows

First couple of tries my code was returning all the rows except both Polling Organization columns. Without changing anything, it now only scrapes the table headers and the tbody tag (no table rows).

url = "https://ropercenter.cornell.edu/presidential-approval/highslows"
driver = webdriver.Firefox()
driver.get(url)

driver.implicitly_wait(12)
soup = BeautifulSoup(driver.page_source, 'lxml')
table = soup.find_all('table')
approvalData = pd.read_html(str(table[0]))
approvalData = pd.DataFrame(approvalData[0], columns = ['President', 'Highest %', 'Polling Organization & Dates H' 'Lowest %', 'Polling Organization & Dates L'])

Should I use explicit wait? If so, which condition should I wait for since the dynamic table is not interactive?

Also, why did the output of my code change after running it multiple times?

2 Answers2

2

Maybe more cheating, but easier solution, which indeed solves your problem, but in other way, would be to take a look what frontend does (using developer tools), and discover it calls the api, which returns JSON value, so no selenium is really needed. requests and pandas are enough.

import requests
import pandas as pd

url = "https://ropercenter.cornell.edu/presidential-approval/api/presidents/highlow"

data = requests.get(url).json()
df = pd.io.json.json_normalize(data)
>>> df
>>> df
                            president.id  president.active president.surname president.givenname president.shortname  ... low.approve  low.disapprove low.noOpinion low.sampleSize      low.presidentName
0   e9c0d19b-dfe9-49cf-9939-d06a0f256e57              True             Biden                 Joe                None  ...          33              53            13         1313.0              Joe Biden
1   bc9855d5-8e97-4448-b62e-1fb2865c79e6              True             Trump              Donald                None  ...          29              68             3         5360.0           Donald Trump
2   1c49881f-0f0c-4a53-9b2c-0dd6540f88e4              True             Obama              Barack                None  ...          37              57             5         1017.0           Barack Obama
3   ceda6415-5975-404d-8049-978758a7d1f8              True              Bush           George W.             W. Bush  ...          19              77             4         1100.0         George W. Bush
4   4f7344de-a7bd-4bc6-9147-87963ae51095              True           Clinton                Bill                None  ...          36              50            14          800.0           Bill Clinton
5   116721f1-f947-4c14-b0b5-d521ed5a4c8b              True              Bush         George H.W.           H.W. Bush  ...          29              60            11         1001.0       George H.W. Bush
6   43720f8f-0b9f-43b0-8c0d-63da059e7a57              True            Reagan              Ronald                None  ...          35              56             9         1555.0          Ronald Reagan
7   7aa76fd3-e1bc-4e9a-b13c-463a64e0c864              True            Carter               Jimmy                None  ...          28              59            13         1542.0           Jimmy Carter
8   6255dd77-531d-46c6-bb26-627e2a4b3654              True              Ford              Gerald                None  ...          37              39            24         1519.0            Gerald Ford
9   f1a23b06-4200-41e6-b137-dd46260ac4d8              True             Nixon             Richard                None  ...          23              55            22         1589.0          Richard Nixon
10  772aabfd-289b-4f10-aaae-81a82dd3dbc6              True           Johnson           Lyndon B.                None  ...          35              52            13         1526.0      Lyndon B. Johnson
11  d849b5a8-f711-4ac9-9728-c3915e17bb6a              True           Kennedy             John F.                None  ...          56              30            14         1550.0        John F. Kennedy
12  e22fd64a-cf20-4bc4-8db6-b4e71dc4483d              True        Eisenhower           Dwight D.                None  ...          48              36            16            NaN   Dwight D. Eisenhower
13  ab0bfa04-61da-49d1-8069-6992f6124f17              True            Truman            Harry S.                None  ...          22              65            13            NaN        Harry S. Truman
14  11edf04f-9d8d-4678-976d-b9339b46705d              True         Roosevelt         Franklin D.                None  ...          48              43             8            NaN  Franklin D. Roosevelt

[15 rows x 41 columns]
>>> df.columns
Index(['president.id', 'president.active', 'president.surname',
       'president.givenname', 'president.shortname', 'president.fullname',
       'president.number', 'president.terms', 'president.ratings',
       'president.termCount', 'president.ratingCount', 'high.id',
       'high.active', 'high.organization.id', 'high.organization.active',
       'high.organization.name', 'high.organization.ratingCount',
       'high.pollingStart', 'high.pollingEnd', 'high.updated',
       'high.president', 'high.approve', 'high.disapprove', 'high.noOpinion',
       'high.sampleSize', 'high.presidentName', 'low.id', 'low.active',
       'low.organization.id', 'low.organization.active',
       'low.organization.name', 'low.organization.ratingCount',
       'low.pollingStart', 'low.pollingEnd', 'low.updated', 'low.president',
       'low.approve', 'low.disapprove', 'low.noOpinion', 'low.sampleSize',
       'low.presidentName'],
      dtype='object')
Dmytro O
  • 406
  • 5
  • 11
  • Interesting, I agree that this is easier. How did you know where to place "api" on the url? I'm not familiar with JSON. – The Corginator Apr 14 '22 at 19:23
  • 1
    @TheCorginator The similarity in urls is coincidence. As frontend is dynamical and does some loading independently, I thought why not to take a look on what's going on in browsers developer tools (specifically Network tab), I surfed through the responses and found what I was looking for - this url. Then it's about fetching and preprocessing. I used to parse similar webs before and wanted to use optimal resources, which selenium is not, so applied this technique of searching for real data endpoint and then fetching them bypassing FE and dynamical content, so straight to the point, to the data. – Dmytro O Apr 14 '22 at 20:57
1

Using only Selenium, GeckoDriver and to extract the table contents within the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following Locator Strategy:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.firefox.options import Options
    from selenium.webdriver.firefox.service import Service
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    import pandas as pd
    
    options = Options()
    options.add_argument('--disable-blink-features=AutomationControlled')
    s = Service('C:\\BrowserDrivers\\geckodriver.exe')
    driver = webdriver.Firefox(service=s, options=options)
    driver.get('https://ropercenter.cornell.edu/presidential-approval/highslows')
    tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='table table-striped']"))).get_attribute("outerHTML")
    tabledf = pd.read_html(tabledata)
    print(tabledf)
    driver.quit()
    
  • Console Output:

    [                President Highest %  ... Lowest %                     Polling Organization & Dates.1
    0               Joe Biden       63%  ...      33%  Quinnipiac UniversityJan 7th, 2022 - Jan 10th,...
    1            Donald Trump       49%  ...      29%                  PewJan 8th, 2021 - Jan 12th, 2021
    2            Barack Obama       76%  ...      37%  Gallup OrganizationSep 8th, 2011 - Sep 11th, 2011
    3          George W. Bush       92%  ...      19%  American Research GroupFeb 16th, 2008 - Feb 19...
    4            Bill Clinton       73%  ...      36%  Yankelovich Partners / TIME / CNNMay 26th, 199...
    5        George H.W. Bush       89%  ...      29%  Gallup OrganizationJul 31st, 1992 - Aug 2nd, 1992
    6           Ronald Reagan       68%  ...      35%  Gallup OrganizationJan 28th, 1983 - Jan 31st, ...
    7            Jimmy Carter       75%  ...      28%  Gallup OrganizationJun 29th, 1979 - Jul 2nd, 1979
    8             Gerald Ford       71%  ...      37%  Gallup OrganizationJan 10th, 1975 - Jan 13th, ...
    9           Richard Nixon       70%  ...      23%   Gallup OrganizationJan 4th, 1974 - Jan 7th, 1974
    10      Lyndon B. Johnson       80%  ...      35%  Gallup OrganizationAug 7th, 1968 - Aug 12th, 1968
    11        John F. Kennedy       83%  ...      56%  Gallup OrganizationSep 12th, 1963 - Sep 17th, ...
    12   Dwight D. Eisenhower       78%  ...      48%  Gallup OrganizationMar 27th, 1958 - Apr 1st, 1958
    13        Harry S. Truman       87%  ...      22%  Gallup OrganizationFeb 9th, 1952 - Feb 14th, 1952
    14  Franklin D. Roosevelt       84%  ...      48%  Gallup OrganizationAug 18th, 1939 - Aug 24th, ...
    
    [15 rows x 5 columns]]
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • That's weird. Using your code, my output is different for some reason... [Empty DataFrame Columns: [President, Highest %, Polling Organization & Dates, Lowest %, Polling Organization & Dates.1] Index: []] – The Corginator Apr 14 '22 at 21:43
  • What do you see as output? Can you update with a snapshot? – undetected Selenium Apr 14 '22 at 21:46
  • https://picbun.com/p/DPaC8kDS Maybe I installed geckodriver wrong? – The Corginator Apr 14 '22 at 21:51
  • 1
    Strange, we passed the entire `` and it's reading only the headings. Is your Selenium/GeckoDriver/Firefox updated to the latest versions? Can you confirm you have kept the `WebDriverWait` part just like I suggested?
    – undetected Selenium Apr 14 '22 at 21:55
  • I believe they are up to date, I installed Selenium and GeckoDriver yesterday. I made sure to update Firefox as well. I basically put the geckodriver.exe into my python parent folder shown in this link https://picbun.com/p/3PqlvSEi – The Corginator Apr 14 '22 at 22:01
  • @TheCorginator I'm still not sure if the website is detecting GeckoDriver as a bot, however I have updated the answer with my actual test code. Let me know the status. – undetected Selenium Apr 14 '22 at 22:10
  • Your code worked perfectly the first time I ran it. However, I tried running it again and it has the same output as the previous screenshot :( – The Corginator Apr 14 '22 at 22:18
  • 1
    Detection :( Now don't blame me :) I wish I could have expressed more about detection – undetected Selenium Apr 14 '22 at 22:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243918/discussion-between-the-corginator-and-undetected-selenium). – The Corginator Apr 14 '22 at 23:53