0

new to selenium and I have the below question still after searching for solutions.

I am trying to access all the links on this website (https://www.ecb.europa.eu/press/pressconf/html/index.en.html).

The individual links gets loaded in a "lazy-load" fashion. And it gets loaded gradually as user scrolls down the screen.

driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

    # scrolling
    
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    #print(lastHeight)
    
    pause = 0.5
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(pause)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
        print(lastHeight)
    
    # ---
    
    elems = driver.find_elements_by_xpath("//a[@href]")
    for elem in elems:
        url=elem.get_attribute("href")
        if re.search('is\d+.en.html', url):
            print(url)

However it only gets the required link of the last lazy-loading element, and everything before it is not obtained because they are not loaded.

I want to make sure all lazy-loading element to have loaded before executing any scraping codes. How can I do that?

Many thanks

Felton Wang
  • 153
  • 8
  • You can use selenium to simulate scrolling all the way to the bottom and then scrape the results. Simulating scrolling is a bit challenging tho. https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python – Chris Jun 26 '21 at 12:47
  • what is the requirement, scroll down and grab links (till it reaches the bottom ), right ? – cruisepandey Jun 26 '21 at 12:52
  • FYI it’s scraping not scrapping. Scrapping (and ‘to scrap’) means throwing away like rubbish. – DisappointedByUnaccountableMod Jun 26 '21 at 14:05
  • @ChristopherHolder, Yes I tried scrolling to the bottom, but simply by doing that will not load all the lazy-load components, with the ones towards the top&middle of the page remain unloaded. – Felton Wang Jun 27 '21 at 07:56
  • @cruisepandey, grab all the links, but simply scrolling down all the way to the bottom will not load all the "lazy-load" component. I think I'll need to scroll down little by little and scrap along the way. – Felton Wang Jun 27 '21 at 07:58
  • @barny okay thanks. – Felton Wang Jun 27 '21 at 07:58

1 Answers1

0

Selenium was not designed for web-scraping (although in complicated cases it can be useful). In your case, do F12 -> Network and look at the XHR tab when you scroll down the page. You can see that the queries that are added contain the year in their urls. So the page generates subqueries when you scroll down and reach other years.

Look at response tab to find divs and classes and build beautifulsoup 'find_all'. A simple little loop through years with requests and bs is enough :

import requests as rq
from bs4 import BeautifulSoup as bs


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}

resultats = []

for year in range(1998, 2021+1, 1):

    url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
    resp = rq.get(url, headers=headers)
    soup = bs(resp.content)

    titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
    subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
    dates = map(lambda x: x.text, soup.find_all("dt"))

    zipped = list(zip(dates, titles, subtitles))
    resultats.extend(zipped)

resultat contains :

...
('8 November 2012',
  'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
  'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Frankfurt am Main,  8 November 2012'),
 ('4 October 2012',
  'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
  'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Brdo pri Kranju,  4 October 2012'),
...
ce.teuf
  • 746
  • 6
  • 13