4

I'm currently trying to scrape retailmenot.com this is how my code looks so far:

import requests
from collections import OrderedDict

s = requests.session()

s.headers = OrderedDict()
s.headers["Connection"] = "close"
s.headers["Upgrade-Insecure-Requests"] = "1"
s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
s.headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
s.headers["Sec-Fetch-Site"] = "none"
s.headers["Sec-Fetch-Mode"] = "navigate"
s.headers["Sec-Fetch-Dest"] = "document"
s.headers["Accept-Encoding"] = "gzip, deflate"
s.headers["Accept-Language"] = "en-GB,en-US;q=0.9,en;q=0.8"

s.get("https://www.retailmenot.com/sitemap/A")

When I use this code I instantly get redirected to a CloudFlare page. That said whenever I pass my traffic through burpsuite by replacing the last line of my code with this one:

s.get("https://www.retailmenot.com/sitemap/A", proxies = {"https":"https://127.0.0.1:8080"}, verify ="/Users/Downloads/cacert (1).pem")

I get straight to the website. I find this a bit strange and was wondering If anyone could possibly explain to me why this is happing and if there's a way to get similar results by using some different certificate (As in order to use the BurpSuite Certificate I need to keep the app open). Many thanks in advance!

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
  • 1
    It is likely that Burp changes the order of headers, adds some headers or similar and thus bypasses the bot protection. Better compare incoming and outgoing requests. It likely has nothing to do with the certificate. – Steffen Ullrich Nov 23 '20 at 17:23
  • @SteffenUllrich Thanks for the reply. I know the order of the headers that BurpSuite is sending which is why I'm using `OrderedDict()`. The crazy thing is that I tried quite a lot of thing and it seems that the only thing that works is to use the BurpSuite certificate. Do you have any idea on what's going on? – Nazim Kerimbekov Nov 23 '20 at 18:06
  • Hard to tell. Maybe it is the TLS fingerprint then. By using BurpSuite the TLS connection is between BurpSuite the and server and thus it uses the properties if the TLS configuration there. – Steffen Ullrich Nov 23 '20 at 18:29
  • @SteffenUllrich Thank you very much for your reply. I think it might be because of the TLS fingerprint? do you know if there are any other certificates I could use? – Nazim Kerimbekov Nov 23 '20 at 21:50
  • TLS fingerprint is completely unrelated to the certificates used. – Steffen Ullrich Nov 23 '20 at 22:11
  • @SteffenUllrich Oh I see, is there any way to add fingerprints to python requests? – Nazim Kerimbekov Nov 24 '20 at 07:21
  • Fingerprints depend on the TLS stack, ciphers used etc. There is no "set exactly this fingerprint". – Steffen Ullrich Nov 24 '20 at 11:47
  • Since Python 3.7 a standard `dict` is guaranteed to remember insertion order so that using an `OrderedDict` becomes unnecessary if that is your primary concern. – Booboo Nov 26 '20 at 12:05
  • @Booboo Thanks for your reply! That said the problem is that I get different outputs in Burpsuite Repeater and Python, any Idea why this is happening ? – Nazim Kerimbekov Nov 26 '20 at 14:54
  • @NazimKerimbekov No, I was just mentioning that as an aside. However, you have made *two* changes in the variation that works for you: (1) You have specified a proxy and (2) you have specified what I assume is not the standard .pem file. You might want to try just making one of these changes one at a time to see which one makes the difference, if any. I somehow doubt that the standard .pem file could have been the issue. – Booboo Nov 26 '20 at 15:28
  • @Booboo: The non-standard CA file is needed to access the SSL intercepting proxy without validation errors, i.e. this is the CA used to issue the new certificates by the proxy. It will fail when using the CA and not using the proxy and also if using the proxy and not using this CA. – Steffen Ullrich Nov 26 '20 at 20:23

1 Answers1

8

It looks the problem is the underlying client side TLS behavior.

I have an older version of Python using OpenSSL 1.1.1b and a newer one using OpenSSL 1.1.1f. It fails with the first version but works with the second version. This would also explain why it works with Burp: it uses a slightly different TLS behavior.

I've tried to track the problem down: Making the non-working version use the ciphers of the working version will not help. The main other difference are the supported signature algorithms. And actually with the mentioned openssl 1.1.1b (but also with newer versions shipped with Anaconda Python) the difference can be reduced to sigalgs:

 $ openssl s_client -connect www.retailmenot.com:443 -crlf
 ...[various output]...
 <paste the expected HTTP request>
 ...
 HTTP/1.1 403 Forbidden

 $ openssl s_client -connect www.retailmenot.com:443 -crlf -sigalgs 'ECDSA+SHA256'
 ...[various output]...
 <paste the expected HTTP request>
 ...
 HTTP/1.1 200 OK

Unfortunately I can see no way in Python requests to directly set the signature algorithms in the TLS stack. The API is not exposed and it simply uses the default - and thus fails or succeeds depending on how OpenSSL was built.

But it looks like it is possible to indirectly set the value by specifying a different security level:

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.ssl_ import create_urllib3_context

CIPHERS = ('DEFAULT:@SECLEVEL=2')
class CipherAdapter(HTTPAdapter):
    def init_poolmanager(self, *args, **kwargs):
        context = create_urllib3_context(ciphers=CIPHERS)
        kwargs['ssl_context'] = context
        return super(CipherAdapter, self).init_poolmanager(*args, **kwargs)

    def proxy_manager_for(self, *args, **kwargs):
        context = create_urllib3_context(ciphers=CIPHERS)
        kwargs['ssl_context'] = context
        return super(CipherAdapter, self).proxy_manager_for(*args, **kwargs)

s = requests.session()
s.mount('https://www.retailmenot.com/', CipherAdapter())
...
print(s.get("https://www.retailmenot.com/sitemap/A"))

This, together with the specific header settings, results in my tests in <Response [200]> whereas with the same Python version and without the changed security level it results in <Response [403]>.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
  • Hey Steffe. Thank you very much for your answer! Is there a way to use some kind of other CA in python requests in order to make it work? If not is there some kind of walkaround or is it something that just can't be done with python requests? – Nazim Kerimbekov Nov 27 '20 at 08:39
  • @NazimKerimbekov: This is totally unrelated to the CA used. The CA is only used locally by the client to verify the server certificate. It does not affect how the TLS handshake looks like and does not affect the answer by the server. And again, it works for me with one build of Python (Ubuntu 20.04) so it is possible to do with Python, depending on how the OpenSSL used in Python was build – Steffen Ullrich Nov 27 '20 at 09:37
  • Thank you very much for your reply Steffen. I'm running Python 3.9 on an iMac. Should I update my python version? Change computers? – Nazim Kerimbekov Nov 28 '20 at 10:54
  • @NazimKerimbekov: It has nothing to do with the version of Python. Since Python does not offer the API to set the signature algorithms it fully depends on how OpenSSL (which is used by Python) was compiled. So it will not help to just update your Python. And again, it works with Python on Ubuntu 20.04, so changing computers to this (or installing as VM on the Mac) will help. – Steffen Ullrich Nov 28 '20 at 11:15
  • @SteffenUllrich might the solution under "Asymmetric key algorithms (RSA and ECDSA)" described [here](https://pypi.org/project/requests-http-signature/) work? – reverse_engineer Nov 29 '20 at 07:40
  • @reverse_engineer: This is something different. What is needed is manipulation of the signature algorithm extension in the TLS ClientHello. This could be done with `-sigalgs` in `openssl s_client` or with [SSL_CTX_set1_sigalgs_list](https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set1_sigalgs_list.html) in C. Unfortunately this API is not exposed in Python. – Steffen Ullrich Nov 29 '20 at 08:05
  • @SteffenUllrich OK, I see, sorry I didn't get the exact problem. I think that's managed in the ssl package of python. I suspect there is a way to do this, would seem strange that Python can't handle this, we just have to go in the lower-level packages. – reverse_engineer Nov 29 '20 at 08:29
  • @reverse_engineer: I think I have a solution by setting the security level through the ciphers. Works at least for me. See updated answer. – Steffen Ullrich Nov 29 '20 at 08:31
  • @SteffenUllrich Cool, seems to work without a too nasty workaround! – reverse_engineer Nov 29 '20 at 08:37
  • @SteffenUllrich Thank you very much for updating your answer! strangely enough, I'm still getting status code 403 – Nazim Kerimbekov Nov 30 '20 at 12:25
  • @NazimKerimbekov: what version of openssl you are using in Python, i.e. `python -c 'import ssl; print(ssl.OPENSSL_VERSION)'`? – Steffen Ullrich Nov 30 '20 at 14:47
  • @SteffenUllrich LibreSSL 2.8.3 :) – Nazim Kerimbekov Nov 30 '20 at 21:57
  • @NazimKerimbekov: I fear that will not work. This also does not support SECLEVEL. Better use a Python compiled with OpenSSL 1.1.1. – Steffen Ullrich Nov 30 '20 at 22:19
  • any idea how I'd go around and include proxies in this (when I add `proxies=` to the s.get it yields me the same initial problem) – Nazim Kerimbekov Oct 30 '21 at 10:06
  • Hi @SteffenUllrich, I just change computers and it seems that this isn't working anymore (I'm on a MacBook Pro with the M1 chip). You're solution saved me a crazy amount of time during these past two years and I was wondering if you knew what's going on now? Many thanks!!! – Nazim Kerimbekov Sep 11 '22 at 17:36
  • @NazimKerimbekov: I recommend that you ask a new question with all necessary details, like version of Python, version of OpenSSL and a minimal code to reproduce the problem. – Steffen Ullrich Sep 11 '22 at 18:08