0

I currently have a selenium function which does the following summary of the code:

def (list):
    FOR LOOP in list: # Page A (initial), Contains 12

    requests,bs4 grabs element coordinates.  
    [f''string transforms into CSS selector]. # this is the list and loops through this
    selenium.driver opens, detect and selects that element

    FOR LOOP in [f'string...']: # Page B:, Contains 1

        Driver.current url, used to prepare new elements to be detected
        requests,bs4 grabs element coordinates. # this is list and loops through this 
        f''string transforms into CSS selector.
        
        selenium.driver opens, detect and selects that element
        download beginds.
        sleep for .5 sec
        driver goes back to previous page.

Now, my problem is that at predictable iterations, specifically when the for loop B is on its 6/12 element in the list it would crash with the following error code:

'//OBJECT//' is not clickable at point (591, 797). Other element would receive the click: <div style="position: relative" class="cookie-consent-inner">...</div>
  (Session info: MicrosoftEdge=...)
Stacktrace:
Backtrace:
...

Now I don't have any problem it doing that but I wish it would continue to PAGE B 7/12 and so on, since it does have the Driver.back(). Instead the application stops.

I tried encasing the entire thing with a try and except: PASS, to capture this error. However, it then begins from Page A and still misses the rest.

I would like a method where I could somehow do a 'continue' statement somewhere, but I've only started learning and I ran out of ideas. You can see in the raw code I tried to do a FOR IF: ERROR statement in hopes to put a pass, but that seems like a syntax error. See the raw code below:

import concurrent.futures
import os
import time
import requests
import re

import selenium.common.exceptions
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import multiprocessing

edge_driver = 'C:selenium\\webdriver\\edge'
os.environ['PATH'] += edge_driver
web_links = {'digital archive': 'https://digital.nmla.metoffice.gov.uk/SO_1118bfbb-f2c9-476f-aa07-eb58b6db5ce6/', }


def scraping_bot(css_selector):
    # First stage: Years
    print('FIRST STAGE INITIATED....')

    driver = webdriver.Edge()
    driver.get(web_links.get('digital archive'))
    year_args = (By.CSS_SELECTOR, f'a[href="{css_selector}"]')
    driver.find_element(*year_args).click()

    # Second Stage: Months
    print('SECOND STAGE INITIATED....')

    sTWO_url = driver.current_url
    sTWO_site = requests.get(sTWO_url)
    sTWO_web_objects = BeautifulSoup(sTWO_site.text, 'lxml')
    monthly_placeholders = sTWO_web_objects.find(name='div', attrs={'class': 'twelve columns last results'})
    months = monthly_placeholders.find_all(name='h5')

    month_css_selector = {}
    for month_href_tags in months:
        month_tag = f'{month_href_tags.get_text()}'
        month_hrefs = re.findall(regex, str(month_href_tags))
        for month_href in month_hrefs:
            month_css_selector.update({month_tag: month_href})

    for v, y in zip(month_css_selector.values(), month_css_selector.keys()):
        print(v)  ##############################
        month_args = (By.CSS_SELECTOR, f'a[href="{v}/"]')
        driver.find_element(*month_args).click()

        # Third Stage: Download
        print(f'THIRD STAGE INITIATED for: {y}: {v}')

        sTWO_url = driver.current_url

        download_site = requests.get(sTWO_url)
        content = BeautifulSoup(download_site.text, 'lxml')
        nav_controls = content.find_all('nav')
        download_button = [nav_controls.find(attrs={'title': 'download'}) for nav_controls in nav_controls]
        download_regex = r'(?<=href=\").{1,}(?=\" title)'
        for button in download_button:
            if button is not None:
                print(button)  ##############################
                downl = re.findall(download_regex, str(button))
                if len(downl) == 1:
                    for downl_button in downl:
                        download_args = (By.CSS_SELECTOR, f'a[href="{downl_button}"]')
                        driver.find_element(*download_args).click()
                    time.sleep(2)
                    print(f'THIRD STAGE DOWNLOAD COMPLETE: {y}; {v}')

                    ##### END OF TREE HERE ####
                    driver.back()  # goes back to Second Stage and so on
                else:
                    print(f'Your download button matches exceeds 1: {len(downl)}')
        if selenium.common.exceptions.ElementClickInterceptedException:
            continue


if __name__ == '__main__':

    sONE_url = requests.get(web_links.get('digital archive'))
    sONE_web_objects = BeautifulSoup(sONE_url.text, 'lxml')

    year_placeholder = sONE_web_objects.find(name='div', attrs={'class': 'sixteen columns results-and-filters'})
    years = year_placeholder.find_all(name='div', attrs={'class': ['one_sixth grey_block new-secondary-background result-item',
                                                                   'one_sixth grey_block new-secondary-background result-item last']})  # don't skip, needed for titles.
    unit = [years.find('h5') for years in years]
    regex = r'(?<=href=\").{1,}(?=\/")'  # lookaround = PositiveLookBehind...PositiveLookAhead

    year_css_selector = []

    titles = [years.get('title') for years in years]
    for year_href_tags, year_tag in zip(unit, titles):  # href_tag -> bs4 component
        hrefs = re.findall(regex, str(year_href_tags.get_text))  # href_tag.get_text -> method that enables str.
        for year_href in hrefs:
            year_css_selector.append(f'{year_href}/')

    for i in year_css_selector:
        scraping_bot(i)

Thus, I wish that my expected output would simply pass or continue that skips this erroneous web-page where I can manually download myself.

Human006
  • 92
  • 7
  • You have some indentation errors in your "raw code". I want to make sure I understand: You have two nested loops. The outer loop begins ` for v, y in zip(month_css_selector.values()...` and the inner loop begins `for button in download_button:`. Are you getting an exception in the inner loop and you want to continue the next iteration of the inner loop? It would help if you fix your indentation and insert `try/catch` with comments as to where you want to resume. Also, multiprocessing is not appropriate since each driver is already a process. Multithreading would ultimately be a better choice. – Booboo Jan 10 '22 at 11:58
  • 1
    But using a multithreading pool is tricky because you will either needlessly be creating and destroying too many driver processes instead of reusing them (otherwise, what is the point of using a pool rather than individual `Thread` instances?) or you will figured out a way of reusing a pool of N threads to process M tasks where M > N but leave the N drivers open. See [this](https://stackoverflow.com/questions/53475578/python-selenium-multiprocessing#64513719) when you are ready. – Booboo Jan 10 '22 at 12:09
  • @Booboo, the indents are perfect as far as I am aware. I used a try and except before, but this stops the application, and starts on the if_name==main FOR loop and misses completing the 'inner' and 'outer' loop. i beleive my question is far to complex and long now. So i shall take your advice about multithreading, as well as the Answer – Human006 Jan 10 '22 at 14:41

2 Answers2

0

If I understand your issue, I think you just need to but a "try/catch" in the right place, namely surrounding all the code within the for v, y in zip(month_css_selector.values(),block ... : block in function scraping_bot:

import concurrent.futures
import os
import time
import requests
import re

import selenium.common.exceptions
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
import multiprocessing

edge_driver = 'C:selenium\\webdriver\\edge'
os.environ['PATH'] += edge_driver
web_links = {'digital archive': 'https://digital.nmla.metoffice.gov.uk/SO_1118bfbb-f2c9-476f-aa07-eb58b6db5ce6/', }


def scraping_bot(css_selector):
    # First stage: Years
    print('FIRST STAGE INITIATED....')

    driver = webdriver.Edge()
    driver.get(web_links.get('digital archive'))
    year_args = (By.CSS_SELECTOR, f'a[href="{css_selector}"]')
    driver.find_element(*year_args).click()

    # Second Stage: Months
    print('SECOND STAGE INITIATED....')

    sTWO_url = driver.current_url
    sTWO_site = requests.get(sTWO_url)
    sTWO_web_objects = BeautifulSoup(sTWO_site.text, 'lxml')
    monthly_placeholders = sTWO_web_objects.find(name='div', attrs={'class': 'twelve columns last results'})
    months = monthly_placeholders.find_all(name='h5')

    month_css_selector = {}
    for month_href_tags in months:
        month_tag = f'{month_href_tags.get_text()}'
        month_hrefs = re.findall(regex, str(month_href_tags))
        for month_href in month_hrefs:
            month_css_selector.update({month_tag: month_href})

    for v, y in zip(month_css_selector.values(), month_css_selector.keys()):
        try:
            print(v)  ##############################
            month_args = (By.CSS_SELECTOR, f'a[href="{v}/"]')
            driver.find_element(*month_args).click()
    
            # Third Stage: Download
            print(f'THIRD STAGE INITIATED for: {y}: {v}')
    
            sTWO_url = driver.current_url
    
            download_site = requests.get(sTWO_url)
            content = BeautifulSoup(download_site.text, 'lxml')
            nav_controls = content.find_all('nav')
            download_button = [nav_controls.find(attrs={'title': 'download'}) for nav_controls in nav_controls]
            download_regex = r'(?<=href=\").{1,}(?=\" title)'
            for button in download_button:
                if button is not None:
                    print(button)  ##############################
                    downl = re.findall(download_regex, str(button))
                    if len(downl) == 1:
                        for downl_button in downl:
                            download_args = (By.CSS_SELECTOR, f'a[href="{downl_button}"]')
                            driver.find_element(*download_args).click()
                        time.sleep(2)
                        print(f'THIRD STAGE DOWNLOAD COMPLETE: {y}; {v}')
    
                        ##### END OF TREE HERE ####
                        driver.back()  # goes back to Second Stage and so on
                    else:
                        print(f'Your download button matches exceeds 1: {len(downl)}')
        except selenium.common.exceptions.ElementClickInterceptedException:
            # This is sort of expected:
            pass
        except Exception as e:
            # If it is something else, print it out:
            print('Got exception:', e)


if __name__ == '__main__':

    sONE_url = requests.get(web_links.get('digital archive'))
    sONE_web_objects = BeautifulSoup(sONE_url.text, 'lxml')

    year_placeholder = sONE_web_objects.find(name='div', attrs={'class': 'sixteen columns results-and-filters'})
    years = year_placeholder.find_all(name='div', attrs={'class': ['one_sixth grey_block new-secondary-background result-item',
                                                                   'one_sixth grey_block new-secondary-background result-item last']})  # don't skip, needed for titles.
    unit = [years.find('h5') for years in years]
    regex = r'(?<=href=\").{1,}(?=\/")'  # lookaround = PositiveLookBehind...PositiveLookAhead

    year_css_selector = []

    titles = [years.get('title') for years in years]
    for year_href_tags, year_tag in zip(unit, titles):  # href_tag -> bs4 component
        hrefs = re.findall(regex, str(year_href_tags.get_text))  # href_tag.get_text -> method that enables str.
        for year_href in hrefs:
            year_css_selector.append(f'{year_href}/')

    for i in year_css_selector:
        scraping_bot(i)
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • unfortunately, catching an exception still aborts the current driver and starts at the beginning of the for loop again at IF NAME==MAIN loop. The pass argument becomes an unreachable code. – Human006 Jan 10 '22 at 17:51
  • 1
    I tried it with a Chrome (i.e. a ChromeDriver) and it seemed to work. I saw the one expected exception message printed out and it continued to download PDF files. I then broke out of it after a while. – Booboo Jan 10 '22 at 17:58
  • Perhaps, edge just behaves differently? I shall switch to a ChromeDriver instead. I have accepted this as answer until I find exactly the reason why it is behaving differently for my specifications. – Human006 Jan 10 '22 at 18:10
  • I should have mentioned that I had tested with Chrome, which is the only driver I had ever downloaded. It never occurred to me that it made a difference. – Booboo Jan 10 '22 at 18:18
  • I should add that you should include a call to `driver.quit()` when you are all done so as to not leave any process hanging around. – Booboo Jan 10 '22 at 18:19
  • could you please advice which chrome driver version to use (preferably that one which you used to test) because its trivial to know which chrome-driver version would work for an Edge browser. – Human006 Jan 10 '22 at 18:27
  • 1
    You have to load the version that matches your version of Chrome. When I do Help..About Google Chrome (first select the 3 vertical dots on the upper right portion of the window) I see that I am running version 97.0.4692.71. So at https://chromedriver.chromium.org/downloads I select the corresponding link, i.e. https://chromedriver.storage.googleapis.com/index.html?path=97.0.4692.71/ – Booboo Jan 10 '22 at 18:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240934/discussion-between-human006-and-booboo). – Human006 Jan 10 '22 at 19:00
0

UPDATE to ANSWER

I found that non-clickable errors are either one of the two:

  1. Drivers which selenium opens are different browsers. This is an important demarcation when debugging errors. These selenium browsers often do not come with the prerequisite: such as Cookie preferences or manual preferences your day-to-day browser has. This means webpages open with minute details such as a pop-cookie or different zoom properties. From this, it should be obvious that when creating a script based on paths in your common browser, it may not work on the selenium browser that only comes by default.

  2. When clicking elements, even if selenium detects it with no problem, you should note that clicking is - literally like clicking manually. If the element is not visible, then it will fail. This is easily fixed by using a move to element or, even better is the scroll to view function (see below):

This:

first_element = driver.find_element(By.CSS_SELECTOR, f'a[href="{urls}"]') 
driver.execute_script("arguments[0].scrollIntoView();", first_element) ##
first_element.click()

even better, this:

first_element = driver.find_element(By.CSS_SELECTOR, f'a[href="{urls}"]') 
actions.move_to_element(first_element) ##
first_element.click()

In light of how useful this post might be for those looking to solve their solution, @Booboo's answer is helpful in most cases. However, regarding the problem of selenium drivers and for-loops I found that correct indentation (as mentioned) and arguments was the culprit.

Specifically, my regex function did not catch instances that url were encased by two scenarios:

  1. where it is:

    link rel="alternate" type="application/rss+xml" title="Met Office UA » DWS_2003_06 Comments Feed" href="https://digital.nmla.metoffice.gov.uk/IO_e273bcd1-7131-482d-aec0-04755809ec3a/feed/"

  2. where theres additional elements:

    a class="new-primary new-primary-tint-hover fa fa-download" href="https://digital.nmla.metoffice.gov.uk/download/file/IO_efa3ef81-4812-4c8e-a4ab-055b147644d2" title="download"

I found simply changing the download button to include an 'OR' regex argument helped fix this situation:

(?<=href=").{1,}(?=" title|/">)

instead of

(?<=href=").{1,}(?=" title)

...obviously along with the answer posted above.

Human006
  • 92
  • 7