3

I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!

This is how I go over it now:

import requests
import cookielib

cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text

Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.

The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?

I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either. So, it is interesting to understand how to make it efficiently! Any ideas?!

Jeremy
  • 1
  • 85
  • 340
  • 366
Nik
  • 1,508
  • 1
  • 13
  • 15

2 Answers2

2

You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:

  • hardcoding the cookies before making first request,
  • requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
  • trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
Tadeck
  • 132,510
  • 28
  • 152
  • 198
  • Thanks @Tadeck! I actually don't know the pages in advance and cannot predict what will be the behavior on their side (with or w/o cookies). So, in this case taking into account your comment I think 2 requests are required. BTW, by cookie-enabled I mean that in order to serve the right page their server asks for cookies. When I load the page listed in the example in browser it seems that server exchanges several messages with me before I see the right page. – Nik Nov 19 '12 at 02:25
  • Also, may be there is a way to at least do this 2 sequential requests not for all pages in my DB?! Say some pages serve the page from the beginning, but sometimes I encounter this problem. Is there a way to judge whether the page is **surrogate** or not from the first request? I guess now, still what do you think?! – Nik Nov 19 '12 at 02:29
  • @Nick: It looks like they do not want the page to be scraped, thus do not make it easily identifiable. I think there is no universal way of identifying such cases for several different sites. In this specific case you can try to identify differences - eg. the first response has "`respondwithsignonpage`" header set to "`true`", which you could use for checks. However, this is non-standard HTTP header and you will most likely not find it on other sites. – Tadeck Nov 19 '12 at 04:34
  • Thank you, @Tadeck! I agree with you. I am already comparing differences between files served just for fun to see what is the percentage of such cases. Don't think that these are abundant. – Nik Nov 19 '12 at 05:01
2

I think the winner here might be to use requests's session framework, which takes care of the cookies for you.

That would look something like this:

import requests
import cookielib

user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)

r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text

Try that and see if that works?

jdotjdot
  • 16,134
  • 13
  • 66
  • 118
  • 1
    No, @jdotjdot, it didn't work. The reason is session also needs to have the first interaction to update the cookies. Still two requests are needed in this case. Thanks for the effort though! – Nik Nov 19 '12 at 03:10
  • Yeah, I even tried again using `s.head(...)`, and that didn't work either. Kind of an odd issue. – jdotjdot Nov 19 '12 at 03:23