How to download PDF from url in python

Question

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.

This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.

I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.

I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:

import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time

def extract_url(driver):
    advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
    print(advice_requests)
    for request in advice_requests:
        if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
            link_split = request['name'].split('-')
            if(link_split[-1] == 'filedownload=X'):
                print("..... Successful")
                return request['name']
    print("..... Failed")

def save_advice(advice_url,tracking_num):
    requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
    response = requests.get(advice_url,verify=False)

    with open(f'{tracking_num}.pdf', 'wb') as f:
        f.write(response.content)

def get_payment_advice(tracking_nums):
    options = webdriver.ChromeOptions()
#   options.add_argument('headless')  # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
    driver = webdriver.Chrome(options=options)
    
    for num in tracking_nums:
        print(num,end=" ")
        driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
        try:
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
            time.sleep(0.1)
            advice_url = extract_url(driver)
            save_advice(advice_url,num)
        except:
            pass
    driver.quit()

get_payment_advice['8262']

As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode

Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?

Duplicate question you may look : https://stackoverflow.com/questions/43149534/selenium-webdriver-how-to-download-a-pdf-file-with-python — Kayes Fahim, Nov 06 '21 at 17:54
Does this answer your question? [Selenium Webdriver: How to Download a PDF File with Python?](https://stackoverflow.com/questions/43149534/selenium-webdriver-how-to-download-a-pdf-file-with-python) — Kayes Fahim, Nov 06 '21 at 17:54
I had tried this, this does not work because the url does not directly return the pdf file, it internally calls more requests where one of them return the pdf — dracarys, Nov 06 '21 at 18:16

score 1 · Accepted Answer · edited Aug 04 '22 at 11:17

I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beautifulsoup you can parse html, xml etc.):

from bs4 import BeautifulSoup

def extract_url(driver):
    soup = BeautifulSoup(driver.page_source, "html.parser")
    object_element = soup.find("object")
    data = object_element.get("data")

    return f"https://webice.ongc.co.in{data}"

The hostname part may can be extracted from the driver. I think i did not changed anything else, but if it not work for you, I can paste the full code.

Old Answer:

if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like: "Because of your system configuration the pdf can't be loaded"

This is because the requested site checks some preferences to decide if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.

And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

I have added the imports. According to what you said, I can see the message ```Due to browser or system restrictions, this PDF document cannot be displayed (For example, check PDF viewer installation)```. I tried fixing it but was unable to do so. Can you give some more insight? Also any other alternate way to do this would also be appreciated. — dracarys, Nov 06 '21 at 17:48
May have a look here: https://stackoverflow.com/questions/45631715/downloading-with-chrome-headless-and-selenium — D-E-N, Nov 06 '21 at 18:07
I tried this, it does not seem to work. I tried different solutions from several answers from the above link but none worked — dracarys, Nov 06 '21 at 18:20

How to download PDF from url in python

1 Answers1