5

When you go to this link, the page will run some javascript and then automatically redirect to a pdf. I have a hard time getting that final url from Playwright.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://scnv.io/760y", wait_until="networkidle")
    print(page.url)
    page.close()

Is there a way to get that final url?

chhenning
  • 2,017
  • 3
  • 26
  • 44

1 Answers1

2

There are multiple ways to do it. One way is using page.expect_response:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    
    # Catch any responses with '.pdf' at the end of the url
    with page.expect_response('**/*.pdf') as response:
        page.goto("https://scnv.io/760y")

    print(response.value.url)
    page.close()

Output

https://qcg-media.s3.amazonaws.com/media/uploads/72778/2022/06/20220622_663043_221.pdf

Check out this section of the documentation that details handling network traffic in playwright.

Also note that I did not include wait_until='networkidle' because that was not appropriate for this use case. For that event to trigger, the network must remain idle for at least 500 ms, which does not happen in the case of this website when it's making the request to the pdf. Therefore, if you were to include that, then the code will be inconsistent at best in catching the request we wanted the url of.

Charchit Agarwal
  • 2,829
  • 2
  • 8
  • 20