0

The problem is probably memory usage. The page starts to get really slow and at some point the following error message appears

enter image description here

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import ActionChains


# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Webdriver
wd = webdriver.Chrome(executable_path='/usr/bin/chromedriver', options=options)
# URL
url = 'https://www.techpilot.de/zulieferer-suchen?laserschneiden'

# Load URL
wd.get(url)

# Get HTML
soup = BeautifulSoup(wd.page_source, 'html.parser')
wd.fullscreen_window()


wait = WebDriverWait(wd, 15)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#bodyJSP #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#efficientSearchIframe")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".hideFunctionalScrollbar #CybotCookiebotDialogBodyLevelButtonLevelOptinAllowAll"))).click()
#wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fancyCompLabel")))
roaster=wd.find_element_by_xpath('//*[@id="resultTypeRaster"]')
ActionChains(wd).click(roaster).perform()

#use keys to get where the button is
html = wd.find_element_by_tag_name('html')

c=2
for i in range(100):
    html.send_keys(Keys.END)
    time.sleep(1)
    html.send_keys(Keys.END)
    time.sleep(1)
    html.send_keys(Keys.ARROW_UP)
    try:
        wait.until(EC.presence_of_all_elements_located((By.XPATH, "//*[@id='resultPane']/div["+str(c)+"]/span")))
        loadButton=wd.find_element_by_xpath("//*[@id='resultPane']/div["+str(c)+"]/span")
        loadButton.click()
    except TimeoutException or ElementClickInterceptedException:
        break
    time.sleep(1)
    c+=1
wd.close

heres some links I looked through with similar problems i tried adding the options but it wont work. Some other tips really confuse me so i hope someone can help me here ( im quite new to coding)

heres the links which i looked through

selenium.WebDriverException: unknown error: session deleted because of page crash from tab crashed

python linux selenium: chrome not reachable

unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed with ChromeDriver Selenium

just to clarify the goal of the program is to get a list of all the profiles and scrape stuff from them thats why this part of the programm first loads the whole page to get all those links (afaik i cant just get them with bsoup because of javascript) so i dont have any workaround thx a lot !

Aquitter
  • 45
  • 1
  • 7
  • are you trying to open all of the listed profiles or just what is visible on the cards? – shiny Jul 29 '21 at 17:41
  • im trying to open all profiles on the website ( thats why im instructing selenium to click the load more button and scroll down since the links are not found until they are loaded) hope that answers ur question – Aquitter Jul 29 '21 at 17:51
  • it does. your case is a bit advanced for a beginner. I will alter your code a bit and post it in a bit. – shiny Jul 29 '21 at 17:53
  • thx a lot for the help:) i would appreciate if you could explain what you changed so that im able to know what my mistake was / i can learn from it. (Just to clarify the output that im trying to get is a list of links to each profile ) – Aquitter Jul 29 '21 at 18:03
  • sounds like they keep loading new content without removing the old... which, after the DOM gets loaded up enough, will crash the browser. (probably due to a JS framework getting overloaded) – pcalkins Jul 29 '21 at 18:25
  • no, the problem was due to a iframe that holds the desired information – shiny Jul 29 '21 at 18:55
  • @pcalkins is there a way to remove the old content so that the browser wont crash? or is it just something i cant prevent? – Aquitter Jul 30 '21 at 09:28
  • this might just be a bug on their end. If you can find a paginated view that'd be easier to work with. – pcalkins Jul 30 '21 at 16:48
  • Whats that? And how would it help – Aquitter Jul 30 '21 at 18:11
  • by "paginated" I mean the pages shows a defined number of results and has links for page 1, page 2, etc... until you get to the end of the data. It's like a book instead of a scroll.... – pcalkins Jul 30 '21 at 20:27

2 Answers2

1

like i've mentioned in the comments. this is not an easy task for a beginner. This code should give you a start though.

the biggest problem here is, that the results are loading in via an iframe, so you need to get this first.

Take a look at this code, it will get the basic info of the profiles and will return them as json. if you need some more explantion on this, feel free to ask in the comments.

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait


def get_profile_info(profile_url):
    # gets info of a profile page // Adjust here to get more info
    wd.get(profile_url)
    label_element = WebDriverWait(wd, 5).until(
        EC.presence_of_element_located((By.ID, "labelAddress"))
    )
    label = label_element.find_element_by_tag_name("h1").text

    street = label_element.find_element_by_css_selector(
        "span[itemprop='streetAddress']"
    ).text

    postal_code = label_element.find_element_by_css_selector(
        "span[itemprop='postalCode']"
    ).text

    city = label_element.find_element_by_css_selector(
        "span[itemprop='addressLocality']"
    ).text

    address_region = label_element.find_element_by_css_selector(
        "span[itemprop='addressRegion']"
    ).text

    country = label_element.find_element_by_css_selector(
        "span[itemprop='addressCountry']"
    ).text

    return {
        "label": label,
        "street": street,
        "postal_code": postal_code,
        "city": city,
        "address_region": address_region,
        "country": country,
    }


def get_profile_url(label_element):
    # get the url from a result element
    onlick = label_element.get_attribute("onclick")
    # some regex magic
    return re.search(r"(?<=open\(\')(.*?)(?=\')", onlick).group()


def load_more_results():
    # load more results if needed // use only on the search page!
    button_wrapper = wd.find_element_by_class_name("loadNextBtn")
    button_wrapper.find_element_by_tag_name("span").click()


#### Script starts here ####

# Set some Selenium Options
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

# Webdriver
wd = webdriver.Chrome(options=options)
# Load URL
wd.get("https://www.techpilot.de/zulieferer-suchen?laserschneiden")


# lets first wait for the timeframe
iframe = WebDriverWait(wd, 5).until(
    EC.frame_to_be_available_and_switch_to_it("efficientSearchIframe")
)

# the result parent
result_pane = WebDriverWait(wd, 5).until(
    EC.presence_of_element_located((By.ID, "resultPane"))
)


result_elements = wd.find_elements_by_class_name("fancyCompLabel")

# lets first collect all the links visible
href_list = []
for element in result_elements:
    url = get_profile_url(element)
    href_list.append(url)

# lets collect all the data now
result = []
for href in href_list:
    result.append(get_profile_info(href))

wd.close

# lets see what we've got
print(result)
Dharman
  • 30,962
  • 25
  • 85
  • 135
shiny
  • 658
  • 5
  • 8
  • hey so i looked through your code now. "def get_profile_info(profile_url)" is not really needed since the rest of my code works just fine. Im able to get the profile links with ur code but im not able to run load_more_results because its unable to locate the load next button. – Aquitter Jul 30 '21 at 09:24
  • have you used the load_more_results function after waiting for the iframe? – shiny Jul 30 '21 at 17:11
  • Yes i paid attention to that – Aquitter Jul 30 '21 at 18:10
  • ok, what you also can do is just simply execute the javascript that is loading the new content: `wd.execute_script("loadFollowing();");` – shiny Jul 30 '21 at 19:05
  • hey heres my current progress: excecuting the loadfollowing script is working far smoother then the way i did. the problem is that i still get a page crash when too many profiles load. ( it seems the dom tree gets too large) im trying to remove the elements from the dom tree that are added to the list already :) – Aquitter Jul 31 '21 at 12:28
  • sounds great. do you get the same error message as in your first post? – shiny Aug 01 '21 at 05:08
  • Not anymore removing items from the Dom was the key – Aquitter Aug 02 '21 at 09:52
0

the solution was to remove elements from the dom tree like what @pcalkins said above the dom tree seems to "overload" otherwise

Aquitter
  • 45
  • 1
  • 7