I am building a website recorder that acts like a proxy, in order to test web scrapers on an ongoing basis. It is split into three Docker containers, all on GNU/Linux: (1) a proxy, (2) an API and request queue, and (3) a simple web app.
It works fine for HTTP sites: I click a button in the web app, this makes a request to the API container, and that adds something to an internal request queue, which then requests the site via the proxy. The proxy records the site as it passes through.
However, I'd forgotten that one cannot record HTTPS site traffic, and now I've come to implement this, I've found that proxies just use the CONNECT
verb, and then act as a data exchanger between the client and the target. I believe I cannot replay the same data chunks as part of the encryption uses a randomised, throwaway, symmetric key (however I have a script suitable for testing this, so I will do so just for the educational value!).
So, I was wondering if my fetching client could give up enough secrets for the proxy system to decode the byte-stream? I am using Wget to do the fetch, which I guess would be using OpenSSL. It does not need to be Wget though: if I were using a PHP script with file_get_contents
with a stream context, can I ask the openssl module for the decryption keys?
(To be fair, I will probably not solve the problem in this fashion even if it is possible, I just thought it would be really interesting to learn a bit more about TLS. In practice, I will record a "null" entry against all secure websites in the proxy, and require the requesting service to notify the proxy of header/body data via an API call, so it can be later played back. They will of course have plaintext copies of these items).