How can I capture all network requests and full response data when loading a page in Chrome?

Question

Using Puppeteer, I'd like to load a URL in Chrome and capture the following information:

request URL
request headers
request post data
response headers text (including duplicate headers like set-cookie)
transferred response size (i.e. compressed size)
full response body

Capturing the full response body is what causes the problems for me.

Things I've tried:

Getting response content with response.buffer - this does not work if there are redirects at any point, since buffers are wiped on navigation
intercepting requests and using getResponseBodyForInterception - this means I can no longer access the encodedLength, and I also had problems getting the correct request and response headers in some cases
Using a local proxy works, but this slowed down page load times significantly (and also changed some behavior for e.g. certificate errors)

Ideally the solution should only have a minor performance impact and have no functional differences from loading a page normally. I would also like to avoid forking Chrome.

Why can't you just write a simple program that sends the request and identify it as a Chrome browser, then you won't have to rely on Chrome, you would just impersonate Chrome. Remember the old days of writing a simple server and browser by hand and sending the request and response packets; it still hasn't changed that much. — Guy Coder, Oct 30 '18 at 17:34
@GuyCoder Because I'm interested in monitoring the full page load in Chrome, including Ajax calls etc. — Matt Zeunert, Oct 30 '18 at 18:04

Grant Miller · Answer 1 · 2019-04-26T01:08:05.803

You can enable a request interception with page.setRequestInterception() for each request, and then, inside page.on('request'), you can use the request-promise-native module to act as a middle man to gather the response data before continuing the request with request.continue() in Puppeteer.

Here's a full working example:

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const result = [];

  await page.setRequestInterception(true);

  page.on('request', request => {
    request_client({
      uri: request.url(),
      resolveWithFullResponse: true,
    }).then(response => {
      const request_url = request.url();
      const request_headers = request.headers();
      const request_post_data = request.postData();
      const response_headers = response.headers;
      const response_size = response_headers['content-length'];
      const response_body = response.body;

      result.push({
        request_url,
        request_headers,
        request_post_data,
        response_headers,
        response_size,
        response_body,
      });

      console.log(result);
      request.continue();
    }).catch(error => {
      console.error(error);
      request.abort();
    });
  });

  await page.goto('https://example.com/', {
    waitUntil: 'networkidle0',
  });

  await browser.close();
})();

Was expecting you to write an answer, otherwise I would write the same answer. :D — Md. Abu Taher, Oct 27 '18 at 04:14
Thanks! This approach breaks some sites because at request interception some headers aren't included yet (e.g. Accept and Cookie). https://github.com/GoogleChrome/puppeteer/issues/3436 I want the outgoing request to have the same headers as without request interception. — Matt Zeunert, Oct 27 '18 at 08:09
I think `request.continue` will make a new request rather than use the same data, but `request.respond` should work. — Matt Zeunert, Oct 27 '18 at 08:15
I tried to manipulate the request URL but it doesn't allow it and I couldn't see the different URL in the Tracing of chrome. any ideas on how to do it? — Nisim Joseph, Jan 27 '20 at 11:29

Thomas Dondorf · Answer 2 · 2019-04-04T19:25:19.777

Puppeteer-only solution

This can be done with puppeteer alone. The problem you are describing that the response.buffer is cleared on navigation, can be circumvented by processing each request one after another.

How it works

The code below uses page.setRequestInterception to intercept all requests. If there is currently a request being processed/being waited for, new requests are put into a queue. Then, response.buffer() can be used without the problem that other requests might asynchronously wipe the buffer as there are no parallel requests. As soon as the currently processed request/response is handled, the next request will be processed.

Code

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    const results = []; // collects all results

    let paused = false;
    let pausedRequests = [];

    const nextRequest = () => { // continue the next request or "unpause"
        if (pausedRequests.length === 0) {
            paused = false;
        } else {
            // continue first request in "queue"
            (pausedRequests.shift())(); // calls the request.continue function
        }
    };

    await page.setRequestInterception(true);
    page.on('request', request => {
        if (paused) {
            pausedRequests.push(() => request.continue());
        } else {
            paused = true; // pause, as we are processing a request now
            request.continue();
        }
    });

    page.on('requestfinished', async (request) => {
        const response = await request.response();

        const responseHeaders = response.headers();
        let responseBody;
        if (request.redirectChain().length === 0) {
            // body can only be access for non-redirect responses
            responseBody = await response.buffer();
        }

        const information = {
            url: request.url(),
            requestHeaders: request.headers(),
            requestPostData: request.postData(),
            responseHeaders: responseHeaders,
            responseSize: responseHeaders['content-length'],
            responseBody,
        };
        results.push(information);

        nextRequest(); // continue with next request
    });
    page.on('requestfailed', (request) => {
        // handle failed request
        nextRequest();
    });

    await page.goto('...', { waitUntil: 'networkidle0' });
    console.log(results);

    await browser.close();
})();

Why do you need to pause requests? Why can't you simply let requests continue, and use the `requestfinished` event to check for the URL and response headers and store those? In my case, all I want are the headers associated with a particular request URL. — onassar, Sep 14 '20 at 19:38
@onassar Your use case is different to OPs. The question was how to capture "full response data" not just headers. — Thomas Dondorf, Sep 15 '20 at 18:13
Ahh okay. So if all I care about is the response headers, I could simplify the approach yah? In my case, I call `setRequestInterception` with `true`, and then call `continue` on request objects in the following events: `request`, `requestfailed` and `requestfinished`. The exception is I store the headers in `requestfinished` event calls. That make sense? — onassar, Sep 15 '20 at 19:22
@onassar Yes, if you don't need the buffer you can simplify it. — Thomas Dondorf, Sep 16 '20 at 16:56
what does `[page]` use for? I didn't see it use in your code. — Willis, Jan 21 '21 at 17:18
@Willis It's a destructuring assignment: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment — Thomas Dondorf, Jan 21 '21 at 17:26
@ThomasDondorf got it, sorry I'm not very familiar with js, thank you! — Willis, Jan 22 '21 at 03:22

score 9 · Answer 3 · answered Oct 26 '18 at 13:44

9

I would suggest you to search for a quick proxy server which allows to write requests logs together with actual content.

The target setup is to allow proxy server to just write a log file, and then analyze the log, searching for information you need.

Don't intercept requests while proxy is working (this will lead to slow down)

The performance issues(with proxy as logger setup) you may encounter are mostly related to TLS support, please pay attention to allow quick TLS handshake, HTTP2 protocol in the proxy setup

E.g. Squid benchmarks show that it is able to process hundreds RPS, which should be enough for testing purposes

answered Oct 26 '18 at 13:44

Andrii Muzalevskyi

3,261
16
20

Thanks! I wasn't too keen on using a proxy because of the performance problems I was having, but I'll look into it again. – Matt Zeunert Nov 03 '18 at 13:12
@MattZeunert, thank you, please let me know if you need any help with it – Andrii Muzalevskyi Nov 05 '18 at 15:35

score 1 · Answer 4 · answered Nov 02 '18 at 07:22

1

I would suggest using a tool namely 'fiddler'. It will capture all the information that you mentioned when you load a URL url.

answered Nov 02 '18 at 07:22

ScrapCode

2,109
5
24
44

score 1 · Answer 5 · answered Feb 24 '21 at 10:16

Here's my workaround which I hope will help others.

I had issues with the await page.setRequestInterception(True) command blocking the flow and made the page hanging until timeout.

So I added this function

async def request_interception(req):
    """ await page.setRequestInterception(True) would block the flow, the interception is enabled individually """
    # enable interception
    req.__setattr__('_allowInterception', True)
    if req.url.startswith('http'):
        print(f"\nreq.url: {req.url}")
        print(f"  req.resourceType: {req.resourceType}")
        print(f"  req.method: {req.method}")
        print(f"  req.postData: {req.postData}")
        print(f"  req.headers: {req.headers}")
        print(f"  req.response: {req.response}")
    return await req.continue_()

removed the await page.setRequestInterception(True) and called the function above with page.on('request', lambda req: asyncio.ensure_future(request_interception(req))) in my main().

Without the req.__setattr__('_allowInterception', True) statement Pyppeteer would complain about the intercept not enabled for some requests but works fine for me with it.

Just in case someone interested in the system I'm running Pyppeteer: Ubuntu 20.04
Python 3.7 (venv)

...
pyee==8.1.0
pyppeteer==0.2.5
python-dateutil==2.8.1
requests==2.25.1
urllib3==1.26.3
websockets==8.1
...

I also posted the solution at https://github.com/pyppeteer/pyppeteer/issues/198

Cheers

I think you've confused Puppeteer with Pyppeteer. Puppeteer is for JavaScript, and the Pyppeteer library is just a port from the JavaScript one. — NeuronButter, Jun 25 '21 at 11:22
Hi @NeuronButter, I'm not confused just trying to help those who need help with Pyppeteer - which in fact not the same as Puppeteer. I did that because it's hard to find info for Pyppeteer. Search engines - like Google's - keep returning Puppeteer-related hits. That's how I ended up on this page. Regardless, I take your -1 gracefully since mine isn't a solution for the OP indeed. — Gergely M, Jun 28 '21 at 14:02
That actually makes a lot of sense. I can't remove my -1 vote (sorry!), but in the future, try using "pyppeteer" (with the quotes) on Google, so you get an exact search match, and hopefully more relevant results :) — NeuronButter, Jul 02 '21 at 03:17
Turing bless your soul, experienced colleague! I've been struggling with this for 3 days. — cavalcantelucas, Oct 11 '22 at 07:31

score 0 · Answer 6 · answered Oct 30 '18 at 15:39

0

go to Chrome press F12, then go to "network" tab, you can see there all the http request that the website sends, yo're be able to see the details you mentioned.

answered Oct 30 '18 at 15:39

Jose Rodriguez

251
2
7

That's using `response.buffer` which gets wiped on navigation. – Matt Zeunert Oct 30 '18 at 18:03
there's an checkbox to preserve log, so you can reload the page and you will not lose requests log – Jose Rodriguez Oct 30 '18 at 18:36
It doesn't work, it only shows a "Failed to load response data" message after navigation. – Matt Zeunert Oct 30 '18 at 20:16

How can I capture all network requests and full response data when loading a page in Chrome?

6 Answers6

Puppeteer-only solution

How it works

Code

Linked

Related