0

I've created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can't figure out how I can get the results from next pages as well using requests.

There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.

I've tried with (capable of grabbing names from the first page):

import re
import requests
from bs4 import BeautifulSoup

URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
    'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(URL)
    
    params['WebsiteKey'] = re.search(r"gWebsiteKey[^\']+\'(.*?)\'",r.text).group(1)
    params['hkey'] = re.search(r"gHKey[^\']+\'(.*?)\'",r.text).group(1)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
    payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
    
    r = s.post(URL,params=params,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

In case someone comes up with any solution based on selenium, I've found success already with the same. However, I'm not willing to go that route:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"

with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)

    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
    Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
    wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[@id='ctl01_LoadingLabel' and .='Loading']")))
    soup = BeautifulSoup(driver.page_source,"lxml")
    for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
        print(item.text)

How can I get the rest of the names from that webpage leading to the next pages using requests module?

MITHU
  • 113
  • 3
  • 12
  • 41
  • try this `ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Grid1$ctl00$ctl03$ctl01$GoToPageTextBox: 2` in payload for page no – Chandan Jan 11 '21 at 13:40
  • Are you aware that there appear export buttons for various file formats below the search field when displaying search results? They download all search results in one go, so either your goal may be achieved without coding, or you could load the csv in pandas and process it as you will. – RJ Adriaansen Jan 17 '21 at 22:19

1 Answers1

0

First, click that link in chrome with the network panel open. Then look at the Form Data for the request:

enter image description here

Pay extra attention to __EVENTTARGET and __EVENTARGUMENT.

Next, inspect one of those next links, they will look like this:

<a onclick="return false;" title="Go to page 2" class="rgCurrentPage" href="javascript:__doPostBack('ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Grid1$ctl00$ctl02$ctl00$ctl07','')"><span>2</span></a>

The doPostBack arguments go in __EVENTTARGET and __EVENTARGUMENT and everything else should match what you see in network (headers as well as form data).

It will be helpful to proxy requests through Charles or Fiddler so you can compare the requests side by side.

pguardiario
  • 53,827
  • 19
  • 119
  • 159