4

I have tried all the solutions from this very similar post but unfortunately, while I do not get any helpful error and neither do I get any pdf files in my folder.

To change the configuration so that selenium works headless and downloads to a directory I want, I followed this post and this.

However I don't see anything. Also the behaviors are different when executing interactively vs when running a script. When executing interactively I don't see any error but then nothing happens as well. When running a script I get a not so useful error:

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"a[href*={css_selector}']"))).click()
  File "C----\selenium\webdriver\support\wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

The website in question is here.

The code that I am trying to make working is -

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.headless = True

uri = "http://affidavitarchive.nic.in/CANDIDATEAFFIDAVIT.aspx?YEARID=March-2017+(+GEN+)&AC_No=1&st_code=S24&constType=AC"

driver = webdriver.Firefox(options=options, executable_path=r'C:\\Users\\xxx\\geckodriver.exe')

profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', r'C:\\Users\\xxx\\Downloads')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')

# Function that reads the table in the webpage and extracts the links for the pdfs
def get_links_from_table(uri):
    html = requests.get(uri)
    soup = BeautifulSoup(html.content, 'lxml')
    table = soup.find_all('table')[-1]
    candidate_affidavit_links = []
    for link in table.find_all('a'):
        candidate_affidavit_links.append(link.get('href'))
    return candidate_affidavit_links

candidate_affidavit_links_list = get_links_from_table(uri)

driver.get(uri)

# iterate over the javascript links and try to download the pdf files
for js_link in candidate_affidavit_links_list:
    css_selector = js_link.split("'")[1]
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f"a[href*={css_selector}']"))).click()
    driver.execute_script(js_link)
Rolv Apneseth
  • 2,078
  • 2
  • 7
  • 19
jar
  • 2,646
  • 1
  • 22
  • 47
  • I'm almost not familiar with BeautifulSoup, but maybe you need to put some kind of wait / delay inside `get_links_from_table` method to let a data loaded similarly to what we do in Selenium? Kind of sleep after `html = requests.get(uri)` before `soup = BeautifulSoup(html.content, 'lxml')`? or maybe a line after that? – Prophet Jun 27 '21 at 14:37
  • @Prophet I am not so sure about that. if you inspect the webpage, its quite lightweight and the pdf links are always javascript. You can try printing `candidate_affidavit_links_list` and you'd see that the links have been harvested successfully. So I don't think that may be the issue. But I really don't know to be honest. – jar Jun 27 '21 at 14:40
  • Again, I don't know how it works with BeautifulSoup, but with Selenium any page changing / loading takes much more time than code execution so we have to use some kinds of waits where for every step where the page is changed. – Prophet Jun 27 '21 at 14:45
  • I do once `driver.get(uri)` and then in the last but one line you can see I have `WebDriverWait(driver, 20)......` is that for 20 seconds of wait? Do you want me to increase it and try? – jar Jun 27 '21 at 14:49
  • No, no need. inside the `for js_link in candidate_affidavit_links_list:` loop you are waiting for some elements to be clickable, but I'm afraid the elements list is empty since when your read them the page still not loaded. Or something like this. – Prophet Jun 27 '21 at 14:54
  • I did `print(candidate_affidavit_links_list)` and I see all the elements there. Am I missing something silly here...sorry to bother you so much....its my first time with selenium. – jar Jun 27 '21 at 15:02
  • It's definitely OK. I see it as kind of a funny quiz. Try solve a problem with a lot of missing information :) – Prophet Jun 27 '21 at 15:10
  • I do not understand. Inside the `get_links_from_table` you are getting a list of `a` elements. Why not simply return that list of `a` and then click them one by one? – Prophet Jun 27 '21 at 15:13
  • Also, since you already having `candidate_affidavit_links_list` why to apply `js_link.split("'")[1]` on every one of them? – Prophet Jun 27 '21 at 15:16
  • Maybe instead of `By.CSS_SELECTOR, f"a[href*={css_selector}']"` you can `By.CSS_SELECTOR, f"a[href*='{js_link}']"` and the **MAIN** question: maybe you simply missing a `'` there, as I used in the former code: f"a[href*='{js_link}']"`? – Prophet Jun 27 '21 at 15:19
  • Yes to the previous comment. I probably should have done that. For the second one, I was looking at the links I had shared, and if you look at the css selector then it looks like its the string in the middle of the link that we get. So I use split just to get the middle substring and then find the element using that and click on it – jar Jun 27 '21 at 15:20
  • I tried with your `By.CSS_SELECTOR, f"a[href*='{js_link}']"` and I get the exact same error and I also fixed that ' I missed with my original code for css selector - I don't get any error but I don't see the downloaded files. – jar Jun 27 '21 at 15:22
  • Can all this be done with Selenium or you have to get the data with BeautifulSoup first? – Prophet Jun 27 '21 at 15:28
  • I think all may be done with just selenium. I just use beautiful soup to extract the javascript href links....beautiful soup servers no other purpose than that. – jar Jun 27 '21 at 15:29

2 Answers2

0

If all this can be done with Selenium I would try this:

driver.get(uri)
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "(//table//a)[last()]")))
time.sleep(1)
candidate_affidavit_links = driver.find_elements_by_xpath("(//table//a)[last()]")
for link in candidate_affidavit_links:
    link.click()
    time.sleep(1)

Open the page, wait until at least first link in the table is visible, add some more wait until all the table is surely loaded, get all the a (links) elements to the list, iterate through that list clicking on those elements and putting a delay after each click to make downloading complete.
Possibly you will need to put longer delay after clicking each link to complete downloading file before next downloading is started.
UPD
To disable pop-ups asking to save file etc try this: Instead of just

profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'application/pdf')

put this:

profile.set_preference('browser.helperApps.neverAsk.saveToDisk", "application/csv,application/excel,application/vnd.ms-excel,application/vnd.msexcel,text/anytext,text/comma-separated-values,text/csv,text/plain,text/x-csv,application/x-csv,text/x-comma-separated-values,text/tab-separated-values,data:text/csv')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk", "application/xml,text/plain,text/xml,image/jpeg,application/octet-stream,data:text/csv')
profile.set_preference('browser.download.manager.showWhenStarting',false)
profile.set_preference('browser.helperApps.neverAsk.openFile","application/csv,application/excel,application/vnd.ms-excel,application/vnd.msexcel,text/anytext,text/comma-separated-values,text/csv,text/plain,text/x-csv,application/x-csv,text/x-comma-separated-values,text/tab-separated-values,data:text/csv')
profile.set_preference('browser.helperApps.neverAsk.openFile","application/xml,text/plain,text/xml,image/jpeg,application/octet-stream,data:text/csv')
profile.set_preference('browser.helperApps.alwaysAsk.force', false)
profile.set_preference('browser.download.useDownloadDir', true)
profile.set_preference('dom.file.createInChild', true)

Not sure you need all this, but I have all this and it works for me

Prophet
  • 32,350
  • 22
  • 54
  • 79
  • I get this error - `AttributeError: module 'selenium.webdriver.support.expected_conditions' has no attribute 'element_to_be_visible'` . Replaced it with `element_to_be_clickable` - I don't see any error but then I don't see any file as well. I tried to print `link` but I get only 1 output instead of 13 for that url. – jar Jun 27 '21 at 16:09
  • 1
    You are right about `element_to_be_visible`. As about printing the `link` each link should contain single element, while `candidate_affidavit_links` should contain 13 elements – Prophet Jun 27 '21 at 17:10
  • Actually the list has only one element. This is the output for the list as well as the link in the list - ```[] ``` – jar Jun 27 '21 at 19:01
  • Ah I removed `[last()]` from `candidate_affidavit_links` now the list has 13 elements. But why don't I see any PDFs in my Downloads folder? – jar Jun 27 '21 at 19:03
  • I can't see it from here... But if it works - it works. Do you see visually while the test run files are downloaded? – Prophet Jun 27 '21 at 19:07
  • Nope I don't see anything – jar Jun 27 '21 at 19:08
  • that's strange.. I can't run this on my PC, but it should work – Prophet Jun 27 '21 at 19:13
  • I removed the headless stuff and now I can see the browser staring and the download pdf pop ups are appearing. But I don't want to click on Save File -> OK all the time.. There are more than 400 pages. How do i make your exact code work headlessly. The configuration I am using for headless mode is there in my original code. – jar Jun 27 '21 at 19:20
  • I am close..but not quite there yet. The pop ups come and I have to manually save them. Not sure why it doesn't save automatically. – jar Jun 27 '21 at 19:45
  • Tried this - https://stackoverflow.com/questions/50321278/how-to-load-firefox-profile-with-python-selenium as well as this - https://stackoverflow.com/questions/41644381/python-set-firefox-preferences-for-selenium-download-location the pop ups still are coming – jar Jun 27 '21 at 19:51
0

This is much simpler in chrome:

driver = webdriver.Chrome()

driver.execute_cdp_cmd("Page.setDownloadBehavior", {"behavior": "allow", "downloadPath": "/path/to/folder"})

driver.get("http://affidavitarchive.nic.in/CANDIDATEAFFIDAVIT.aspx?YEARID=March-2017+(+GEN+)&AC_No=1&st_code=S24&constType=AC")

for a in driver.find_elements_by_css_selector('a[href*=doPostBack]'):
  a.click()
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • I tried your code and yes, it does automatically start the downloads, but I get this "Failed - Download error" when it starts downloading the PDFs. – jar Jun 28 '21 at 05:38
  • change the downloadPath to the folder you want them in (absolute path) – pguardiario Jun 28 '21 at 06:33