How to bypass Terms and Conditions agreement with Beautiful Soup

Question

I want to scrape this website: https://cage.dla.mil/Home/UsageAgree using Beautiful Soup. What I'm doing:

import requests
url = "https://cage.dla.mil/Home/UsageAgree"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)

which returns HTML from a cookie agreement page. What I am then looking for is to bypass this to scrape the content of the actual page once we accept the cookies.

I followed this post: Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?

and did:

import requests
url = 'https://cage.dla.mil/'
s = requests.Session()
s.cookies.update({'agree': 'True'})
s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print(soup)

but I'm still getting the agreement page.

It seems that one of the cookies always gives a unique value. I'm not sure how to deal with this.

https://stackoverflow.com/questions/57171353/scraping-a-webpage-using-python-beautiful-soup-that-requires-i-agree-to-cooki — Victor Loke Chapelle Hansen, Jun 09 '22 at 14:57
Does this answer your question? [Scraping a webpage using Python (beautiful soup) that requires "I agree to cookies" button being clicked?](https://stackoverflow.com/questions/57171353/scraping-a-webpage-using-python-beautiful-soup-that-requires-i-agree-to-cooki) — baduker, Jun 09 '22 at 15:01

score 1 · Accepted Answer · answered Jun 09 '22 at 16:48

Well, this should work.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}

with requests.Session() as s:
    token = (
        BeautifulSoup(
            s.get(
                "https://cage.dla.mil/Home/UsageAgree",
                headers=headers,
            ).text,
            "lxml",
        ).select_one("form input")["value"]
    )
    payload = {
        "__RequestVerificationToken": token,
        "returningURL": "",
    }
    _ = s.post(
        "https://cage.dla.mil/Home/UsageAgree",
        data=payload,
        headers=headers
    )
    soup = (
        BeautifulSoup(
            s.get("https://cage.dla.mil/", headers=headers).text,
            "lxml",
        ).select("#briefnewslist > div > p > em")
    )
    print("\n".join(p.getText(strip=True) for p in soup))

Output:

Scheduled Maintenance
SAM Validation: Unable To Find A Matching Entity When Asked To Enter Or Validate My Entity Information
SAM Validation: Continue A Registration Update Or Renewal If Validation Fails
SAM.gov Registration for Financial Assistance
Financial Assistance Update
CAGE Expiration Date

How to bypass Terms and Conditions agreement with Beautiful Soup

1 Answers1

Linked