3

I am trying to scrape some data from http://www.pogdesign.co.uk/cat/.

I want to get the channel and the air-time of each program, but the problem is that by default they do not appear. Only after manually configuring the settings and saving them, the channel and the air-time of each program appear.

As I understand after inspecting the 'Network' section in the Chrome's developer tools, what actually happens after I click 'Save Settings' is that a POST request is being sent, with the relevant data parameters (e.g. 's_networks':'on' and etc'), then a GET request is being sent, to retrieve the html file with channel and the air-time displayed.

I tried to emulate this process (POST request then GET request) using both the python's requests package, and the mechanicalsoup package.

requests:

s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')

mechanicalsoup:

mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')

Yet the response I receive does not contain the channel and the air-time data.

The only difference I noticed is that the status code returned from the browser's POST request is 302, and the returned status code from my python requests is 200.

Cœur
  • 37,241
  • 25
  • 195
  • 267
yaakovk
  • 41
  • 4

1 Answers1

3

It is because of cookie which stores the user info, you can try the following code

import requests

s = requests.Session()
data = {
    "style": 3,
    "timezone": "GMT",
    "s_numbers": "on",
    "s_epnames": "on",
    "s_airtimes": "on",
    "s_popups": "on",
    "s_wunwatched": "on",
    "s_sortbyname": "on",
    "s_weekstyle": "on",
    "s_24hr": "on",
    "settings": None
}
cookies = { # you can get the cookie info from dev tool
    "CAT_UID":'' ,
    "PHPSESSID":'' ,
    "_ga": '',
    "_gid": '',
    "_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text
aristotll
  • 8,694
  • 6
  • 33
  • 53