-1

I'm trying to scrape a website using requests. Unfortunately, it takes so long that I need to refresh the cookie from time to time during scraping (I also don't want to have to manually copy a new cookie when I re-run the script in a couple days). I think the easiest way to do this is to load up Chrome with playwright and open the website and then return the cookie that ends up getting set.

I'm not sure how to do this, since it's an async operation because I can only get the cookie in an event listener that triggers a function page.on('request', fn) and then I need to somehow return the cookie once I get it. I tried setting a variable in the outer scope

from playwright.sync_api import sync_playwright

cookie = None
with sync_playwright() as p:
    browser = p.firefox.launch()
    page = browser.new_page()

    def set_cookie(req):
        if "cookie" in req.headers:
            print(req.headers["cookie"])
            cookie = req.headers["cookie"]
    page.on('request', set_cookie)
    page.goto(url)

    browser.close()
print(cookie)

this prints the correct cookie, but ultimately doesn't work because the cookie in the inner function (set_cookie()) is not the same as the cookie in the outer scope (my IDE even complains that I never use the variable I set on line 11), so the outer cookie variable stays None.

I also tried adding a context and getting the cookie through context.cookies()

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    context = browser.new_context()
    page.goto(url)
    print(context.cookies())  # returns []
    browser.close()

but that just returns [], i.e. nothing (I tried both p.chromium and p.firefox).

I could put the first snippet into a separate file and call it with subprocess, but surely there's a better way.

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103

1 Answers1

0

You can assign to a variable defined in an outer scope by declaring that it's a nonlocal variable:

from playwright.sync_api import sync_playwright

def new_cookie(url):
    cookie = None
    def set_cookie(req):
        nonlocal cookie
        if "cookie" in req.headers:
            cookie = req.headers["cookie"]

    with sync_playwright() as p:
        browser = p.firefox.launch()
        page = browser.new_page()
        page.on("request", set_cookie)
        page.goto(url)
        browser.close()

    return cookie

print(new_cookie("https://example.com"))
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103