4

I'm trying to create a script to parse different part numbers from a webpage using requests. If you check on this link and click on Product list tab, you will see the part numbers.

This image

represents where the part numbers are.

I've tried with:

import requests

link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
post_url = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/HRQ'

payload = {"q":4,"ReqID":21,"focus":"f24~v472_0","scroll":[],"events":["e468~12~0~472~0~4","e468_0~6~472"],"ito":22,"kms":4}

with requests.Session() as s:
    s.headers['user-agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    s.headers['referer'] = 'https://www.festo.com/cfp/camosHTML5Client/cH5C/go?q=2'
    s.headers['content-type'] = 'application/json; charset=UTF-8'
    r = s.post(post_url,data=payload)
    print(r.json())

When I execute the above script, I get the following result:

{'isRedirect': True, 'url': '../../camosStatic/Exception.html'}

How can I fetch the part numbers from that site using requests?

In case of selenium, I tried like below to fetch the part numbers but it seems the script can't click on the product list tab if I kick out hardcoded delay from it. Given that I don't wish to go for any hardcoded delay within the script.

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
link = 'https://www.festo.com/cat/en-id_id/products_ADNH'
 
with webdriver.Chrome() as driver:
    driver.get(link)
    wait = WebDriverWait(driver,15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    time.sleep(10)   #I would like to get rid of this hardcoded delay
    
    item = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    driver.execute_script("arguments[0].click();",item)
    for elem in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
        print(elem.text)
Chandan
  • 11,465
  • 1
  • 6
  • 25
MITHU
  • 113
  • 3
  • 12
  • 41
  • 1
    Unless you know exactly the payload you must provide not much we can do to help. Their API seems very cumbersome, using single letters as parameters. The return you are getting seems to be due to an invalid request. I would suggest looking at selenium in such case. – Nic Laforge Dec 22 '20 at 04:50
  • The keys and values within the payload that I've used I've taken from dev tools. – MITHU Dec 22 '20 at 04:56
  • 1
    You cannot do that! Each request will have different values, unless you know how to use them.. it is a lost cause using request. It should be a simple task using selenium if you are familiar with it. – Nic Laforge Dec 22 '20 at 04:58
  • This is definitely not a lost cause using requests as you mentioned. There are always ways which I'm trying to figure out. FYI, I'm very familiar with selenium but I'm not willing to go that route. Thanks. – MITHU Dec 22 '20 at 05:05
  • Did not say impossible.. best of luck! Have you been able to understand the response data? – Nic Laforge Dec 22 '20 at 05:37
  • Usually prefer to simulate requests but in this instance, I admit that Selenium is the better idea. – xwhitelight Dec 22 '20 at 07:53
  • I can't answer the question, but I can give you my opinion. At first I thought it was a cookie issue, but I replayed the request several times, all resulting in a "{"RID":0}" response. Although the request parameters are simple, I noticed small changes between requests and I think the answer is a correct combination of request parameters (and probably cookies). If you insist on using requests, you could visit the "Initiator" tab of the request and examine how Js creates those parameters. The script is not obfuscated, and Chrome has great debugging tools, but Js-R.E. can take some time. – t.m.adam Dec 22 '20 at 17:56
  • To help you get started, in `SubmitPostData()` (the function that submits the request) we see that post data are stored in `gPOSTData`, which is created in `DoSubmit()` and contains `n.ReqID = ++gRequestID_Posted, r && (n.focus = r), n.scroll = SaveScrollEvents(), n.events = gArrEvents, n.externaldata = gExternalEventDataArray.arrData,n.ito = gEventController.GetIdleTime(), n.kms = EncodeMouseState());`. The names give me the impression that most of the parameters do some basic client fingerprinting, possibly to detect automation. – t.m.adam Dec 22 '20 at 18:45
  • Thanks for your great effort and suggestion t.m.adam. I spent a substantial amount of time to solve this using requests with no success. So, at this point it seems to me that I should stick with selenium to avoid overcomplexity as suggested by @Nic Laforge in the first place. However, I've attached my selenium approach above which also gets stuck when it comes to click on the product list tab. Thanks. – MITHU Dec 22 '20 at 20:32
  • You can use selenium. Notice that your table is FRAME. Just switch to the frame and find numbers using xpath or some other locator. – Gaj Julije Dec 24 '20 at 12:55
  • That is what I did in my above script @Gaj Julije. – MITHU Dec 24 '20 at 15:46
  • I'm not sure that you can bypass the "accept all cookies" button with python requests. When you physically click the button a "ckns_policy" cookie and 4 other cookies are set for the session. I have attempted to bypass this button by manually adding the "ckns_policy" cookie and the others, but so far nothing works. – Life is complex Dec 27 '20 at 15:38

3 Answers3

1

To grab part numbers from the webpage using Selenium you need to:

  • Induce WebDriverWait for the object frame to be available and switch to it.

  • Induce WebDriverWait for the desired element to be clickable and click on the Accept all cookies.

  • Switch back to the default_content()

  • Induce WebDriverWait for the desired frame to be available and switch to it.

  • Induce WebDriverWait for the staleness_of() of the stale element.

  • Click on the tab with text as Product list using execute_script().

  • You can use the following Locator Strategies:

    driver.get('https://www.festo.com/cat/en-id_id/products_ADNH')
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME,"object")))
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.btn.btn-primary#accept-all-cookies"))).click()
    driver.switch_to.default_content()
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#CamosIFId")))
    WebDriverWait(driver, 20).until(EC.staleness_of(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Product list']")))))
    driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[text()='Product list']"))))
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='ah']/img//following::div[2]")))])
    driver.quit()
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

    ['539691', '539692', '539693', '539694']
    

Reference

You can find a couple of relevant discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I tried your code but it didn't work. It gets stuck when it is supposed to click on the product list tab and eventually throws timeout exception pointing at this `print()` line. The code that I've pasted above is a working one but I had to use hardcoded delay to make it work. – MITHU Dec 26 '20 at 06:02
1

The difficulty for the driver is to click to the 'Product list' button so I found a solution:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
from selenium import webdriver
import time

class NoPartsNumberException(Exception):
    pass

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)


driver.get("https://www.festo.com/cat/en-id_id/products_ADNH")
wait.until(ec.frame_to_be_available_and_switch_to_it(wait.until(ec.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
driver.switch_to.default_content()
wait.until(ec.frame_to_be_available_and_switch_to_it((By.XPATH, "//iframe[@name='CamosIF']")))

endtime = time.time() + 30
while True:
    try:
        if time.time() > endtime:
            raise NoPartsNumberException('No parts number found')
        product_list = wait.until(ec.element_to_be_clickable((By.XPATH, "//div[@id='f24']")))
        product_list.click()
        part_numbers_elements = wait.until(ec.visibility_of_all_elements_located((By.XPATH, "//div[contains(@id, 'v471')]")))
        break
    except (TimeoutException, StaleElementReferenceException):
        pass

part_numbers = [p.text for p in part_numbers_elements[1:]]
print(part_numbers)

driver.close()

In this way the driver clicks on the 'Product list' button until it opens the window containing the part numbers and you have to wait much less than 10 seconds as in your code with the hardcoded time sleep

marco
  • 525
  • 4
  • 11
  • If you don't reach the break statement, this loop will loop infinitely. You are raising/re-raising ```TimeoutException``` and ignore it in your catch statement. You should have your time expired check as the condition to enter the loop. You also need to handle if it never reach a break statement. The OP is already using ```until()``` don't see the point of creating such while loop. – Nic Laforge Dec 27 '20 at 02:50
  • The break Is reachable when the code reaches the timeout while it ignores the TimeoutException given when it doesn't find the part numbers elements. If it doesn't find part numbers it breaks the while loop and raise the TimeoutException. Basically what I wrote is a custom function to wait until the element is really clickable. Have you tried my code? Because I tested it and it works very well – marco Dec 27 '20 at 11:08
  • I have to downvote this. Just saying it is working does not make the above right. You have not tested the time expired for sure. Again it will create an infinite loop. I agree you created another wait.until() with your code, but why. You can utilize the ```until``` from selenium and create your own class (see my answer). Removing the need to handle a timeout, exception, duplicate code, error prone – Nic Laforge Dec 27 '20 at 22:19
  • sorry you are right, I wrote my own TimeoutException but it was a bad idea. I'll edit the answer with a different kind of exception – marco Dec 27 '20 at 23:06
1

I believe you have covered the iframe and WebDriverWait concept well.

The site seems to re-render the content a few times prior to be able to actual get the right element and click on it. Hence why you had to add a sleep of 10 seconds.

There is a believe that EC must be used when using WebDriverWait. EC is only a bunch of class helpers to retrieve an element with some defined properties (i.e visible, hidden, clickable...)

In your case, ec.visibility_of_all_elements_located was a good choice. But once the element is retrieve, the DOM is re-rentered and you will generate a StaleElementReferenceException if you use the WebElement click method. Also believe that the click using JS will just be ignored as the passed element is no longer present.

Since until() can be used to determine when to return element, why not utilize it and create our own EC class:

class SelectProductTab(object):
    def __init__(self, locator):
        self.locator = locator
        self._selected_background_image = 'url("IMG?i=ec2a883936d53541a030c2ddb511e7e8&s=p")'

    def __call__(self, driver):
        els = driver.find_elements(*self.locator)
        if len(els) > 0:
            els[0].click()
        else:
            return False
        return els[0] if self.__is_selected(els[0]) else False

    def __is_selected(self, el):
        return self._selected_background_image in el.get_attribute('style')

This class will do the following:

  1. Retrieve the element
  2. Click on it
  3. Ensure the desired tab is selected. Basically ensure the click did work
  4. Upon the tab being selected, returns the element back to the caller

One part is not handled, as WebDriverWait already supports it, it is to handle exception. In your case, you will be facing StaleElementReferenceException.

wait = WebDriverWait(driver, 30, ignored_exceptions=(StaleElementReferenceException, ))

Then call until() with your own implementation of an EC class:

wait.until(SelectProductTab((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))

Full code:

with webdriver.Chrome(ChromeDriverManager().install(), options=options) as driver:
    driver.get(link)
    wait = WebDriverWait(driver, 15)
    wait.until(EC.frame_to_be_available_and_switch_to_it(
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "object")))))
    wait.until(EC.presence_of_element_located(
        (By.CSS_SELECTOR, "#btn-group-cookie > input[value='Accept all cookies']"))).click()
    driver.switch_to.default_content()
    wait.until(EC.frame_to_be_available_and_switch_to_it(
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "iframe#CamosIFId")))))
    
    # Sleep was removed, click is now handled inside our own EC class + will ensure the tab is selected
    wait = WebDriverWait(driver, 30, ignored_exceptions=(StaleElementReferenceException, ))
    
    wait.until(SelectProductTab((By.CSS_SELECTOR, "[id='r17'] > [id='f24']")))
    
for elem in wait.until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-ctcwgtname='tabTable'] [id^='v471_']")))[1:]:
            print(elem.text)

Output:

539691
539692
539693
539694

Note to import the following import:

from selenium.common.exceptions import StaleElementReferenceException
Nic Laforge
  • 1,776
  • 1
  • 8
  • 14