50

Using Puppeteer, I'd like to load a URL in Chrome and capture the following information:

  • request URL
  • request headers
  • request post data
  • response headers text (including duplicate headers like set-cookie)
  • transferred response size (i.e. compressed size)
  • full response body

Capturing the full response body is what causes the problems for me.

Things I've tried:

  • Getting response content with response.buffer - this does not work if there are redirects at any point, since buffers are wiped on navigation
  • intercepting requests and using getResponseBodyForInterception - this means I can no longer access the encodedLength, and I also had problems getting the correct request and response headers in some cases
  • Using a local proxy works, but this slowed down page load times significantly (and also changed some behavior for e.g. certificate errors)

Ideally the solution should only have a minor performance impact and have no functional differences from loading a page normally. I would also like to avoid forking Chrome.

Matt Zeunert
  • 16,075
  • 6
  • 52
  • 78
  • Why can't you just write a simple program that sends the request and identify it as a Chrome browser, then you won't have to rely on Chrome, you would just impersonate Chrome. Remember the old days of writing a simple server and browser by hand and sending the request and response packets; it still hasn't changed that much. – Guy Coder Oct 30 '18 at 17:34
  • @GuyCoder Because I'm interested in monitoring the full page load in Chrome, including Ajax calls etc. – Matt Zeunert Oct 30 '18 at 18:04

6 Answers6

34

You can enable a request interception with page.setRequestInterception() for each request, and then, inside page.on('request'), you can use the request-promise-native module to act as a middle man to gather the response data before continuing the request with request.continue() in Puppeteer.

Here's a full working example:

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const result = [];

  await page.setRequestInterception(true);

  page.on('request', request => {
    request_client({
      uri: request.url(),
      resolveWithFullResponse: true,
    }).then(response => {
      const request_url = request.url();
      const request_headers = request.headers();
      const request_post_data = request.postData();
      const response_headers = response.headers;
      const response_size = response_headers['content-length'];
      const response_body = response.body;

      result.push({
        request_url,
        request_headers,
        request_post_data,
        response_headers,
        response_size,
        response_body,
      });

      console.log(result);
      request.continue();
    }).catch(error => {
      console.error(error);
      request.abort();
    });
  });

  await page.goto('https://example.com/', {
    waitUntil: 'networkidle0',
  });

  await browser.close();
})();
Grant Miller
  • 27,532
  • 16
  • 147
  • 165
  • 3
    Was expecting you to write an answer, otherwise I would write the same answer. :D – Md. Abu Taher Oct 27 '18 at 04:14
  • 1
    Thanks! This approach breaks some sites because at request interception some headers aren't included yet (e.g. Accept and Cookie). https://github.com/GoogleChrome/puppeteer/issues/3436 I want the outgoing request to have the same headers as without request interception. – Matt Zeunert Oct 27 '18 at 08:09
  • 1
    I think `request.continue` will make a new request rather than use the same data, but `request.respond` should work. – Matt Zeunert Oct 27 '18 at 08:15
  • I tried to manipulate the request URL but it doesn't allow it and I couldn't see the different URL in the Tracing of chrome. any ideas on how to do it? – Nisim Joseph Jan 27 '20 at 11:29
  • 4
    `request-promise-native` seems to be deprecated as of now. – FelipeKunzler Mar 24 '20 at 19:57
24

Puppeteer-only solution

This can be done with puppeteer alone. The problem you are describing that the response.buffer is cleared on navigation, can be circumvented by processing each request one after another.

How it works

The code below uses page.setRequestInterception to intercept all requests. If there is currently a request being processed/being waited for, new requests are put into a queue. Then, response.buffer() can be used without the problem that other requests might asynchronously wipe the buffer as there are no parallel requests. As soon as the currently processed request/response is handled, the next request will be processed.

Code

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    const results = []; // collects all results

    let paused = false;
    let pausedRequests = [];

    const nextRequest = () => { // continue the next request or "unpause"
        if (pausedRequests.length === 0) {
            paused = false;
        } else {
            // continue first request in "queue"
            (pausedRequests.shift())(); // calls the request.continue function
        }
    };

    await page.setRequestInterception(true);
    page.on('request', request => {
        if (paused) {
            pausedRequests.push(() => request.continue());
        } else {
            paused = true; // pause, as we are processing a request now
            request.continue();
        }
    });

    page.on('requestfinished', async (request) => {
        const response = await request.response();

        const responseHeaders = response.headers();
        let responseBody;
        if (request.redirectChain().length === 0) {
            // body can only be access for non-redirect responses
            responseBody = await response.buffer();
        }

        const information = {
            url: request.url(),
            requestHeaders: request.headers(),
            requestPostData: request.postData(),
            responseHeaders: responseHeaders,
            responseSize: responseHeaders['content-length'],
            responseBody,
        };
        results.push(information);

        nextRequest(); // continue with next request
    });
    page.on('requestfailed', (request) => {
        // handle failed request
        nextRequest();
    });

    await page.goto('...', { waitUntil: 'networkidle0' });
    console.log(results);

    await browser.close();
})();
Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
  • Why do you need to pause requests? Why can't you simply let requests continue, and use the `requestfinished` event to check for the URL and response headers and store those? In my case, all I want are the headers associated with a particular request URL. – onassar Sep 14 '20 at 19:38
  • @onassar Your use case is different to OPs. The question was how to capture "full response data" not just headers. – Thomas Dondorf Sep 15 '20 at 18:13
  • 1
    Ahh okay. So if all I care about is the response headers, I could simplify the approach yah? In my case, I call `setRequestInterception` with `true`, and then call `continue` on request objects in the following events: `request`, `requestfailed` and `requestfinished`. The exception is I store the headers in `requestfinished` event calls. That make sense? – onassar Sep 15 '20 at 19:22
  • 1
    @onassar Yes, if you don't need the buffer you can simplify it. – Thomas Dondorf Sep 16 '20 at 16:56
  • what does `[page]` use for? I didn't see it use in your code. – Willis Jan 21 '21 at 17:18
  • @Willis It's a destructuring assignment: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment – Thomas Dondorf Jan 21 '21 at 17:26
  • @ThomasDondorf got it, sorry I'm not very familiar with js, thank you! – Willis Jan 22 '21 at 03:22
9

I would suggest you to search for a quick proxy server which allows to write requests logs together with actual content.

The target setup is to allow proxy server to just write a log file, and then analyze the log, searching for information you need.

Don't intercept requests while proxy is working (this will lead to slow down)

The performance issues(with proxy as logger setup) you may encounter are mostly related to TLS support, please pay attention to allow quick TLS handshake, HTTP2 protocol in the proxy setup

E.g. Squid benchmarks show that it is able to process hundreds RPS, which should be enough for testing purposes

Andrii Muzalevskyi
  • 3,261
  • 16
  • 20
1

I would suggest using a tool namely 'fiddler'. It will capture all the information that you mentioned when you load a URL url.

ScrapCode
  • 2,109
  • 5
  • 24
  • 44
1

Here's my workaround which I hope will help others.

I had issues with the await page.setRequestInterception(True) command blocking the flow and made the page hanging until timeout.

So I added this function

async def request_interception(req):
    """ await page.setRequestInterception(True) would block the flow, the interception is enabled individually """
    # enable interception
    req.__setattr__('_allowInterception', True)
    if req.url.startswith('http'):
        print(f"\nreq.url: {req.url}")
        print(f"  req.resourceType: {req.resourceType}")
        print(f"  req.method: {req.method}")
        print(f"  req.postData: {req.postData}")
        print(f"  req.headers: {req.headers}")
        print(f"  req.response: {req.response}")
    return await req.continue_()

removed the await page.setRequestInterception(True) and called the function above with page.on('request', lambda req: asyncio.ensure_future(request_interception(req))) in my main().

Without the req.__setattr__('_allowInterception', True) statement Pyppeteer would complain about the intercept not enabled for some requests but works fine for me with it.

Just in case someone interested in the system I'm running Pyppeteer: Ubuntu 20.04
Python 3.7 (venv)

...
pyee==8.1.0
pyppeteer==0.2.5
python-dateutil==2.8.1
requests==2.25.1
urllib3==1.26.3
websockets==8.1
...

I also posted the solution at https://github.com/pyppeteer/pyppeteer/issues/198

Cheers

Gergely M
  • 583
  • 4
  • 11
  • 1
    I think you've confused Puppeteer with Pyppeteer. Puppeteer is for JavaScript, and the Pyppeteer library is just a port from the JavaScript one. – NeuronButter Jun 25 '21 at 11:22
  • 2
    Hi @NeuronButter, I'm not confused just trying to help those who need help with Pyppeteer - which in fact not the same as Puppeteer. I did that because it's hard to find info for Pyppeteer. Search engines - like Google's - keep returning Puppeteer-related hits. That's how I ended up on this page. Regardless, I take your -1 gracefully since mine isn't a solution for the OP indeed. – Gergely M Jun 28 '21 at 14:02
  • 1
    That actually makes a lot of sense. I can't remove my -1 vote (sorry!), but in the future, try using "pyppeteer" (with the quotes) on Google, so you get an exact search match, and hopefully more relevant results :) – NeuronButter Jul 02 '21 at 03:17
  • 1
    Turing bless your soul, experienced colleague! I've been struggling with this for 3 days. – cavalcantelucas Oct 11 '22 at 07:31
0

go to Chrome press F12, then go to "network" tab, you can see there all the http request that the website sends, yo're be able to see the details you mentioned.

Jose Rodriguez
  • 251
  • 2
  • 7