Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.
This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262#
does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.
I want to be able to call the url with a variable for the query param TRACKNO
and to be able to save the pdf file using python.
I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:
import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def extract_url(driver):
advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
print(advice_requests)
for request in advice_requests:
if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
link_split = request['name'].split('-')
if(link_split[-1] == 'filedownload=X'):
print("..... Successful")
return request['name']
print("..... Failed")
def save_advice(advice_url,tracking_num):
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(advice_url,verify=False)
with open(f'{tracking_num}.pdf', 'wb') as f:
f.write(response.content)
def get_payment_advice(tracking_nums):
options = webdriver.ChromeOptions()
# options.add_argument('headless') # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
driver = webdriver.Chrome(options=options)
for num in tracking_nums:
print(num,end=" ")
driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
time.sleep(0.1)
advice_url = extract_url(driver)
save_advice(advice_url,num)
except:
pass
driver.quit()
get_payment_advice['8262']
As it can be seen I get all the network calls that the browser makes in the first line of the extract_url
function and then parse each request to find the correct one. However this does not work in headless mode
Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?