web scraping for javascript __doPostBack contain a herf in td

Question

I want to scrape a website i.e. is https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27= using selenium but I am able to scrape only one page not other pages.

Here I am using selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())
driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=')
WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']")))
driver.find_element_by_xpath("//td/a[text()='2']").click()

numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']"))))
print(numLinks)
for i in range(numLinks):
    print("Perform your scraping here on page {}".format(str(i+1)))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]"))).click()
driver.quit()

here is the html content

    <td><span>1</span></td>
    <td><a 
     href="javascript:__doPostBack 
(&#39;dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView&#39;,&#39;Page$2&#39;)" 
style="color:#333333;">2</a>
     </td>

This throws an error:

raise TimeoutException(message, screen, stacktrace)
TimeoutException

score 1 · Accepted Answer · edited Feb 12 '21 at 15:59

1

To scrape the website https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27= using Selenium you can use the following Locator Strategy:

Code Block:

  from selenium import webdriver
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC

  chrome_options = webdriver.ChromeOptions() 
  chrome_options.add_argument("start-maximized")
  driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\WebDrivers\chromedriver.exe')
  driver.get("https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27")
  while True:
      try:
          WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]"))).click()
          print("Clicked for next page")
      except TimeoutException:
          print("No more pages")
          break
  driver.quit()

Console Output:

  Clicked for next page
  Clicked for next page
  Clicked for next page
  .
  .
  .

Explaination: If you observe the HTML DOM the page numbers are within a <table> with a dynamic id attribute containing the text UNSPSCSearch_gvDetailsSearchView. Further the page numbers are within the last <tr> which is having a child <table>. With in the child table the current page number is within a <span> which holds the key. So to click() on the next page number you just need to identify the following <a> tag with index [1]. Finally, as the element is having javascript:__doPostBack() you have to induce WebDriverWait for the desired element_to_be_clickable().

You can find a detailed discussion in How do I wait for a JavaScript __doPostBack call through Selenium and WebDriver

edited Feb 12 '21 at 15:59

DisappointedByUnaccountableMod

6,656
4
18
22

answered Aug 21 '19 at 21:56

undetected Selenium

183,867
41
278
352

could you help in scraping the code and title using beautiful soup for each page as i am using this for single page unspsc_link = "https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=" link = requests.get(unspsc_link).text soup = BeautifulSoup(link, 'lxml') right_table = soup.find('table', id="dnn_ctr1535_UNSPSCSearch_gvDetailsSearchView") df = pd.read_html(str(right_table))[0] # Clean up the DataFrame df = df[[0, 1]] df.columns = df.iloc[0] df = df[1:] print(df) – Ayush Kangar Aug 22 '19 at 12:15
@AyushKangar Yup, that can be done. Can you raise a new question with your new requirement please? – undetected Selenium Aug 22 '19 at 12:18
could you explain "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]" how to choose the path like table span and following – Ayush Kangar Aug 23 '19 at 08:43
@AyushKangar Added an explanation for the solution. let me know if any further queries. – undetected Selenium Aug 23 '19 at 09:24
after scrapping several pages @DebanjanB it throws error StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=76.0.3809.100) – Ayush Kangar Aug 23 '19 at 14:18
@AyushKangar Did you checkout the reference discussion I have added as a footnote within the answer? Does that help you? – undetected Selenium Aug 23 '19 at 14:20
hey @DebanjanB look into this https://stackoverflow.com/questions/57640584/scraping-a-website-using-selenium-for-each-product – Ayush Kangar Aug 24 '19 at 19:39

Pedro Lobito · Answer 2 · 2019-08-21T20:59:51.147

0

To find/click the page numbers you can use:

for x in driver.find_elements_by_xpath("//a[contains(@href,'UNSPSCSearch$gvDetailsSearchView')]"):
    if x.text.isdigit():
        print(x.text)
        #x.click()
        #...

Output:

2
3
4
...

Based on your comment, you can use:

max_pages = 10
for page_number in range(2, max_pages+1):
    for x in driver.find_elements_by_xpath("//a[contains(@href,'UNSPSCSearch$gvDetailsSearchView')]"):
        if x.text.isdigit():
            if int(x.strip()) == page_number:
                x.click()
                #parse results here
                break

edited Aug 21 '19 at 20:59

answered Aug 21 '19 at 18:29

Pedro Lobito

94,083
31
258
268

when i run this how would i scrap the pages and when i uses this code till x.click() it throws error after mentioning 2 and 3 StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=76.0.3809.100) – Ayush Kangar Aug 21 '19 at 19:05
You've to add a controller to know in which page you are and parse the new page numbers. – Pedro Lobito Aug 21 '19 at 19:20
I've posted an update. If my answer helped you, please consider accepting it as the correct answer, thank you! – Pedro Lobito Aug 21 '19 at 21:00

web scraping for javascript __doPostBack contain a herf in td

2 Answers2

Linked