1

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.

I am trying to access the website as follows:

from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")

This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?

I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.

user2229219
  • 342
  • 4
  • 16

2 Answers2

1

Try setting:

cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)

This will bypass the cookie consent page and will land you staight to the page.

Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.

George Daramouskas
  • 3,720
  • 3
  • 22
  • 51
-1

I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.

import requests

cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}

r = requests.post('http://wikipedia.org', cookies=cookie)
Community
  • 1
  • 1
Koga
  • 523
  • 4
  • 13