-1

I am currently using selenium webdriver to parse through this webpage (https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified) to extract all startup urls using Python. I tried all relevant methods mentioned in this post: How can I scroll a web page using selenium webdriver in python? and also other suggestions online.

However, it did not work out for this website. It only loaded the first 25 startups. Some code examples:

from time import sleep
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

webdriver = webdriver.Chrome(executable_path='chromedriver')

# Write into csv file
filename = "startups_urls.csv"
f = open(BLD / "processed/startups_urls.csv", "w")
headers = "startups_urls\n"
f.write(headers)

url = "https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified"

webdriver.get(url)
time.sleep(3)

# Get scroll height
last_height = webdriver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(3)

    # Calculate new scroll height and compare with last scroll height
    new_height = webdriver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

htmlSource = webdriver.page_source
page_soup = BeautifulSoup(htmlSource, "html.parser")
startups = page_soup.findAll("div", {"class": "type-element type-element--h3 hbox entity-name__name entity-name__name--black"})
if startups != []:
    for startup in startups:
        startups_href = startup.a["href"]
        startups_url = "https://startup-map.berlin" + startups_href
        open_file.write(startups_url + "\n")
else:
    print("NaN.") 
              
f.close()
driver.close()

Any suggestions? Thank you very much.

betahat
  • 27
  • 1
  • 7

2 Answers2

0

You can get the indication of the scrolling process according to the position of the vertical-thumb element.
So what you can do is to get the translateY value of it's style and compare it with it's previous value similarly to how you currently trying to compare the new_height with last_height.
That element can be located by this cssSelector: #window-scrollbar .vertical-thumb
So you can do the following:

element = webdriver.find_element_by_css_selector("#window-scrollbar .vertical-thumb")
attributeValue = element.get_attribute("style")

Now the attributeValue string contains something like this

position: relative; display: block; width: 100%; background-color: rgba(34, 34, 34, 0.6); border-radius: 4px; z-index: 1500; height: 30px; transform: translateY(847px);

Now you can find the substring containing the translateY and extract the number from it as following:

index = attributeValue.find('translateY(')
sub_string = attributeValue[index:]
new_y_value = int(filter(str.isdigit, sub_string))

In case int(filter(str.isdigit, sub_string)) doesn't work properly (while it should) try using instead it

new_y_value = re.findall('\d+', sub_string)

To use re you have to import it first by

import re
Prophet
  • 32,350
  • 22
  • 54
  • 79
  • Thanks a lot. But I got this error message: "int() argument must be a string, a bytes-like object or a number, not 'filter'" Could you please clarify a bit more? – betahat May 12 '21 at 22:22
  • OK, we need to debug this. Unfortunately I have no Python installed on my machine at all, so I will ask you several questions. So, when debugging does `attributeValue` contains string with values like in the answer? Then does index presents int number? What contains the `sub_string`? – Prophet May 13 '21 at 09:32
  • yes, ```attributeValue``` shows ```'position: relative; display: block; width: 100%; background-color: rgba(34, 34, 34, 0.6); border-radius: 4px; z-index: 1500; height: 499px; transform: translateY(0px);' ``` I used ```index = attributeValue.find('translateY(0px);')```, which returns 151. I did not use ```index = attributeValue.find('translateY(')```, which returns -1. ```sub_string``` contains ```'translateY(0px);'``` – betahat May 13 '21 at 14:43
  • Good, so the problem is really with the last code line. If so please "import re" and try "new_y_value = re.findall('\d+', sub_string )". I will update the answer as well to be clear – Prophet May 13 '21 at 14:57
  • Thanks. ```new_y_value``` returns ```['0']``` but it does not help me to web scrape. Could you please let me know where I should place the code for my code in the question? – betahat May 13 '21 at 15:52
  • Sure, you already know how to scroll page down. F.e. with `webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")` So just scroll page down and get the new `new_y_value` and compare it with the previous value of if exactly like you compared it to last_height – Prophet May 13 '21 at 15:57
  • sorry, it still scraped only the first 25 items. – betahat May 14 '21 at 14:45
  • Wait! As I see from your code you first scrolling until the scrolling is stopped and only after that, not inside the `While True:` block, you are dealing with the `href` links. So, in this manner you will anyway get the amount of links presented on a single screen, no more – Prophet May 14 '21 at 15:17
  • thanks. so there is nothing I can do about it? – betahat May 14 '21 at 19:01
  • First please let me know: when you run the current code, the page is scrolled. Yes? – Prophet May 15 '21 at 18:24
  • nope, it is not. But anyway, thanks a lot for your help. – betahat May 16 '21 at 18:29
0
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()

driver.get("https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified")
time.sleep(3)
driver.find_element_by_css_selector("#window-scrollbar .vertical-track")


a = driver.switch_to.active_element
a.send_keys(Keys.PAGE_DOWN)

just use Keys.PAGE-DOWN

PDHide
  • 18,113
  • 2
  • 31
  • 46
  • Thanks. I just tried, but it did not work out either. It shows the error message "WebElement' object has no attribute 'page_source". I just updated the code in the question. Could you please let me know what it is wrong? Thanks a lot. – betahat May 12 '21 at 22:19