4

I have a program to download photos on various websites. Each url is formed at the end of the address by codes, which are accessed in a dataframe. In a dataframe of 8,583 lines

The sites have javascript, so I use selenium to access the src of the photos. And I download it with urllib.request.urlretrieve

Example of a photo site: http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/PB/150000608817

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
import urllib.request, urllib.parse, urllib.error

# Root URL of the site that is accessed to fetch the photo link
url_raiz = 'http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/'

# Accesses the dataframe that has the "sequencial" type codes
candidatos = pd.read_excel('candidatos_2018.xlsx',sheet_name='Sheet1', converters={'sequencial': lambda x: str(x), 'cpf': lambda x: str(x),'numero_urna': lambda x: str(x)})

# Function that opens each page and takes the link from the photo
def pegalink(url):
    profile = webdriver.FirefoxProfile()
    browser = webdriver.Firefox(profile)

    browser.get(url)
    time.sleep(10)

    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")
    browser.close()

    link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']

    return link

# Function that downloads the photo and saves it with the code name "cpf"
def baixa_foto(nome, url):
      urllib.request.urlretrieve(url, nome)


# Iteration in the dataframe
for num, row in candidatos.iterrows():
    cpf = (row['cpf']).strip()
    uf = (row['uf']).strip()
    print(cpf)
    print("-/-")
    sequencial = (row['sequencial']).strip()

    # Creates full page address
    url = url_raiz + uf + '/' + sequencial

    link_foto = pegalink(url)

    baixa_foto(cpf, link_foto)

Please I look guidance for:

  • Put a try-Exception type to wait for the page to load (I'm having errors reading the src - after many hits the site takes more than ten seconds to load)

  • And I would like to record all possible errors - in a file or dataframe - to write down the "sequencial" code that gave error and continue the program

Would anyone know how to do it? The guidelines below were very useful, but I was unable to move forward

I put in a folder a part of the data I use and the program, if you want to look: https://drive.google.com/drive/folders/1lAnODBgC5ZUDINzGWMcvXKTzU7tVZXsj?usp=sharing

Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43

1 Answers1

3

put your code within :

   try:
       WebDriverWait(browser, 30).until(wait_for(page_has_loaded))
       # here goes your code
   except: Exception
            print "This is an unexpected condition!"

For waitForPageToLoad :

def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

30 above is time in seconds. You can adjust it as per your need.

Approach 2 :

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded) 


def pegalink(url):
    profile = webdriver.FirefoxProfile()
    browser = webdriver.Firefox(profile)

    browser.get(url)

    try:
        with wait_for_page_load(browser):
            html = browser.page_source
            soup = BeautifulSoup(html, "html.parser")
            browser.close()
            link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']

    except Exception:
        print ("This is an unexpected condition!")
        print("Erro em: ", url)
        link = "Erro"

    return link
Abhishek_Mishra
  • 4,551
  • 4
  • 25
  • 38
  • Thank you. I put more comment above about the error. I have not found a correct try yet. Is there any way the code always wait for the page to be fully loaded? – Reinaldo Chaves Aug 23 '18 at 23:02
  • 1
    in that case you should use explicit page waiting. Updated my answer – Abhishek_Mishra Aug 24 '18 at 10:58
  • Thank you very much. I modified the code above, but now the program only opens the links, apparently does not run the lines below the "WebDriverWait" and does not save the images. Please, did I do something wrong? – Reinaldo Chaves Aug 24 '18 at 12:14
  • 1
    yes, I understand, as sometimes with multiple executions in same code it may lead to problem, but I am not sure why it dint work at all. Anyways please try the update one and let me know. – Abhishek_Mishra Aug 24 '18 at 12:37
  • Thank you. But now every time error appears: This is an unexpected condition! Erro em: http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/MA/100000601895 Erro em: 33489874315 – Reinaldo Chaves Aug 24 '18 at 13:10
  • I put in a folder a part of the data I use and the program, if you want to look: https://drive.google.com/drive/folders/1lAnODBgC5ZUDINzGWMcvXKTzU7tVZXsj?usp=sharing – Reinaldo Chaves Aug 24 '18 at 13:12