13

I login to a site and it gives a browser cookie.

I go to a URL and it is a json response.

How do I scrape the page after entering await page.goto('blahblahblah.json'); ?

Amy Coin
  • 141
  • 1
  • 1
  • 4

2 Answers2

31

Another way which doesn't give you intermittent issues is to evaluate the body when it becomes available and return it as JSON e.g.

const puppeteer = require('puppeteer'); 

async function run() {

    const browser = await puppeteer.launch( {
        headless: false  //change to true in prod!
    }); 

    const page = await browser.newPage(); 

    await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json');

   //I would leave this here as a fail safe
    await page.content(); 

    innerText = await page.evaluate(() =>  {
        return JSON.parse(document.querySelector("body").innerText); 
    }); 

    console.log("innerText now contains the JSON");
    console.log(innerText);

    //I will leave this as an excercise for you to
    //  write out to FS...

    await browser.close(); 

};

run(); 
Rippo
  • 22,117
  • 14
  • 78
  • 117
3

You can intercept the network response, like this:

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  page.on('response', async response => {
    console.log('got response', response._url)
    const data = await response.buffer()
    fs.writeFileSync('/tmp/response.json', data)
  })
  await page.goto('https://raw.githubusercontent.com/GoogleChrome/puppeteer/master/package.json', {waitUntil: 'networkidle0'})
  await browser.close()
})()
Pasi
  • 2,606
  • 18
  • 14
  • I am getting ` (node:6503) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): Error: Protocol error (Network.getResponseBody): Target closed. ` – Amy Coin Jan 30 '18 at 15:00
  • Hmm, I get that occasionally, too. Adding `{waitUntil: 'networkidle0'}` seems to help - apparently it was possible to reach `browser.close()` before the whole response body had been loaded. – Pasi Jan 30 '18 at 17:25
  • 1
    Note that you can use `await response.json()` if you want to use the data inside your code. – Nicolai Weitkemper Jun 09 '20 at 13:32