I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar
object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD
request first time instead of GET
to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!