10

I wish to connect to a website and download some pdf files. The website allows us to view the content only after log in. It asks us to log in using OTP and can't be login at more than 3 devices simultaneously.

I wish to download all the pdf listed. So I previously tried the

python playwright open --save-storage websitename.json

to save the login. But it doesn't work for that specific website. The website.json file was empty whereas it worked for other websites.

Therefore the only solution I could think of know, is to connect to the current browser, open that website and then download those pdfs.

If you have some solution for this or even some other approach please do inform.

I was also thinking about switching over to puppeteer for the same. But, I don't know the html parsing using node.js, since I feel using css selectors more comfortable, so I can't switch it.

2 Answers2

6

To connect to an already running browser (Chrome) session, you can use connect_over_cdp method (added in v1.9 of playwright).

For this, you need to start Chrome in debug mode. Create a desktop shortcut for Chrome and edit Target section of shortcut properties to start it with debug mode. Add --remote-debugging-port=9222 to the target box in shortcut properties so that the target path becomes: C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222

Now start Chrome and check if it is in debug mode. For this open a new tab and paste this url in the address bar: http://localhost:9222/json/version. If you are in debug mode, you should see now a page with a json response, otherwise if you are in "normal" mode, it will say "Page not found" or something similar.

Now in your python script, write following code to connect to chrome instance:

browser = playwright.chromium.connect_over_cdp("http://localhost:9222")
default_context = browser.contexts[0]
page = default_context.pages[0]

Here is the full script code:

# Import the sync_playwright function from the sync_api module of Playwright.
from playwright.sync_api import sync_playwright

# Start a new session with Playwright using the sync_playwright function.
with sync_playwright() as playwright:
    # Connect to an existing instance of Chrome using the connect_over_cdp method.
    browser = playwright.chromium.connect_over_cdp("http://localhost:9222")

    # Retrieve the first context of the browser.
    default_context = browser.contexts[0]

    # Retrieve the first page in the context.
    page = default_context.pages[0]

    # Print the title of the page.
    print(page.title)

    # Print the URL of the page.
    print(page.url)
ePandit
  • 2,905
  • 2
  • 24
  • 15
4

Playwright is basically same as Puppeteer. So it wouldn't be a problem if you switch between the two.
You can use puppeteer-core or playwright to control your existing browser installation, for example Chrome, and then use the existing user data (Profile) folder to load the specified website login info (cookies, webstorage, etc).

const launchOptions = {
    headless: false,
    executablePath: '/Applications/Google Chrome/Contents/MacOS/Google Chrome', // For MacOS
    // executablePath: 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe', // For Windows
    // executablePath: '/usr/bin/google-chrome'  // For Linux
    args: [
        '--user-data-dir=/Users/username/Library/Application Support/Google/Chrome/', // For MacOS
        // '--user-data-dir=%userprofile%\\AppData\\Local\\Chrome\\User Data', // For Windows
        // '--profile-directory=Profile 1' // This to select default or specified Profile
    ]
}
const puppeteer = require('puppeteer-core')
const browser = await puppeteer.launch(launchOptions)

For more details about Playwright's method, you can check this workaround: https://github.com/microsoft/playwright/issues/1985

Edi Imanto
  • 2,119
  • 1
  • 11
  • 17