0

This might be the stupidest question i asked yet but this is driving me nuts...

Basically i want to get all links from profiles but for some reason selenium gives different amounts of links most of the time ( sometimes all sometimes only a tenth)

I experimented with time.sleep and i know its affecting the output somehow but i dont understand where the problem is. (but thats just my hypothesis maybe thats wrong)

I have no other explanation why i get incosistent output. Since i get all profile links from time to time the program is able to find all relevant profiles.

heres what the output should be (for different gui input)

input:anlagenbau output:3070

Fahrzeugbau output:4065

laserschneiden output:1311



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException
from urllib.request import urlopen
from datetime import date
from datetime import datetime
import easygui
import re
from selenium.common.exceptions import NoSuchElementException
import time

#input window suchbegriff
suchbegriff = easygui.enterbox("Suchbegriff eingeben | Hinweis: suchbegriff sollte kein '/' enthalten")

#get date and time
now = datetime.now()
current_time = now.strftime("%H-%M-%S")
today = date.today()
date = today.strftime("%Y-%m-%d")

def get_profile_url(label_element):
    # get the url from a result element
    onlick = label_element.get_attribute("onclick")
    # some regex magic
    return re.search(r"(?<=open\(\')(.*?)(?=\')", onlick).group()


def load_more_results():
    # load more results if needed // use only on the search page!
    button_wrapper = wd.find_element_by_class_name("loadNextBtn")
    button_wrapper.find_element_by_tag_name("span").click()


#### Script starts here ####

# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

# Webdriver
wd = webdriver.Chrome(options=options)
# Load URL
wd.get("https://www.techpilot.de/zulieferer-suchen?"+str(suchbegriff))


# lets first wait for the timeframe
iframe = WebDriverWait(wd, 5).until(
    EC.frame_to_be_available_and_switch_to_it("efficientSearchIframe")
)

# the result parent
result_pane = WebDriverWait(wd, 5).until(
    EC.presence_of_element_located((By.ID, "resultPane"))
)

#get all profilelinks as list
time.sleep(5)
href_list = []
wait = WebDriverWait(wd, 15)

while True:
    try:
        #time.sleep(1)
        wd.execute_script("loadFollowing();")
        #time.sleep(1)
        try:
            wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
        except TimeoutException:
            break
        #time.sleep(1) # beeinflusst in irgeneiner weise die findung der ergebnisse
        result_elements = wd.find_elements_by_class_name("fancyCompLabel")
        #time.sleep(1)
        for element in result_elements:
            url = get_profile_url(element)
            href_list.append(url)
        #time.sleep(2)
        while True:
            try:
                element = wd.find_element_by_class_name('fancyNewProfile')
                wd.execute_script("""var element = arguments[0];element.parentNode.removeChild(element);""", element)
            except NoSuchElementException:
                break
            
    except NoSuchElementException:
        break

wd.close #funktioniert noch nicht
print("####links secured: "+str(len(href_list)))

Aquitter
  • 45
  • 1
  • 7

2 Answers2

1

Since you say that the sleep is affecting the number of results, it sounds like they're loading asynchronously and populating as they're loaded, instead of all at once.

The first question is whether you can ask the web site developers to change this, to only show them when they're all loaded at once.

Assuming you don't work for the same company as them, consider:

  • Is there something else on the page that shows up when they're all loaded? It could be a button or a status message, for instance. Can you wait for that item to appear, and then get the list?
  • How frequently do new items appear? You could poll for the number of results relatively infrequently, such as only every 2 or 3 seconds, and then consider the results all present when you get the same number of results twice in a row.
Ryan Lundy
  • 204,559
  • 37
  • 180
  • 211
1

The issue is the method presence_of_all_elements_located doesn't wait for all elements matching a passed locator. It waits for presence of at least 1 element matching the passed locator and then returns a list of elements found on the page at that moment matching that locator.
In Java we have

wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(element, expectedElementsAmount));

and

wait.until(ExpectedConditions.numberOfElementsToBe(element, expectedElementsAmount));

With these methods you can wait for predefined amount of elements to appear etc.
Selenium with Python doesn't support these methods.
The only thing you can see with Selenium in Python is to build some custom method to do these actions.
So if you are expecting some amount of elements /links etc. to appear / be presented on the page you can use such method.
This will make your test stable and will avoid usage of hardcoded sleeps.
UPD
I have found this solution.
This looks to be the solution for the mentioned above methods.
This seems to be a Python equivalent for wait.until(ExpectedConditions.numberOfElementsToBeMoreThan(element, expectedElementsAmount));

myLength = 9
WebDriverWait(browser, 20).until(lambda browser: len(browser.find_elements_by_xpath("//img[@data-blabla]")) > int(myLength))

And this

myLength = 10
WebDriverWait(browser, 20).until(lambda browser: len(browser.find_elements_by_xpath("//img[@data-blabla]")) == int(myLength))

Is equivalent for Java wait.until(ExpectedConditions.numberOfElementsToBe(element, expectedElementsAmount));

Prophet
  • 32,350
  • 22
  • 54
  • 79