0

I need to retrive this site (www.extra.com.br) in a Python project.

Neither

page = requests.get("https://www.extra.com.br/")

nor

page = requests.get("http://www.extra.com.br/")

works, the script keeps stuck on this line.

So I went to try using curl to retrieve the page and check it off-line.

curl http://www.extra.com.br also does not work. This is the output using -v.

  • STATE: INIT => CONNECT handle 0x6dfd90; line 1392 (connection #-5000)
  • Rebuilt URL to: https://www.extra.com.br/
  • Added connection 0. The cache now contains 1 members
  • STATE: CONNECT => WAITRESOLVE handle 0x6dfd90; line 1428 (connection #0) % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 23.74.196.194...
  • TCP_NODELAY set
  • STATE: WAITRESOLVE => WAITCONNECT handle 0x6dfd90; line 1509 (connection #0)
  • Connected to www.extra.com.br (23.74.196.194) port 443 (#0)
  • STATE: WAITCONNECT => SENDPROTOCONNECT handle 0x6dfd90; line 1561 (connection #0) 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Marked for [keep alive]: HTTP default
  • ALPN, offering h2
  • ALPN, offering http/1.1
  • Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
  • successfully set certificate verify locations:
  • CAfile: C:/Program Files/Git/mingw64/ssl/certs/ca-bundle.crt CApath: none
  • TLSv1.2 (OUT), TLS header, Certificate Status (22): } [5 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Client hello (1): } [512 bytes data]
  • STATE: SENDPROTOCONNECT => PROTOCONNECT handle 0x6dfd90; line 1575 (connection #0) { [5 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server hello (2): { [108 bytes data]
  • TLSv1.2 (IN), TLS handshake, Certificate (11): { [2799 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server key exchange (12): { [333 bytes data]
  • TLSv1.2 (IN), TLS handshake, Server finished (14): { [4 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Client key exchange (16): } [70 bytes data]
  • TLSv1.2 (OUT), TLS change cipher, Client hello (1): } [1 bytes data]
  • TLSv1.2 (OUT), TLS handshake, Finished (20): } [16 bytes data]
  • TLSv1.2 (IN), TLS change cipher, Client hello (1): { [1 bytes data]
  • TLSv1.2 (IN), TLS handshake, Finished (20): { [16 bytes data]
  • SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
  • ALPN, server accepted to use http/1.1
  • Server certificate:
  • subject: C=BR; ST=Sao Paulo; L=Sao Paulo; O=CNOVA Comercio Eletronico S.A.; OU=TI; CN=*.extra.com.br
  • start date: Jul 17 00:00:00 2018 GMT
  • expire date: Jul 17 12:00:00 2019 GMT
  • subjectAltName: host "www.extra.com.br" matched cert's "*.extra.com.br"
  • issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
  • SSL certificate verify ok.
  • STATE: PROTOCONNECT => DO handle 0x6dfd90; line 1596 (connection #0) } [5 bytes data]

    GET / HTTP/1.1 Host: www.extra.com.br User-Agent: curl/7.58.0 Accept: /

  • STATE: DO => DO_DONE handle 0x6dfd90; line 1658 (connection #0)
  • STATE: DO_DONE => WAITPERFORM handle 0x6dfd90; line 1783 (connection #0)
  • STATE: WAITPERFORM => PERFORM handle 0x6dfd90; line 1799 (connection #0) 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0* OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 10054
  • Marked for [closure]: Transfer returned error
  • multi_done
  • stopped the pause stream! 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0
  • Closing connection 0
  • The cache now contains 0 members } [5 bytes data] curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 10054
Thadeu Melo
  • 947
  • 14
  • 40
  • In short: the site is protected by Akamai bot detection. One need to set specific HTTP headers correctly to work around this detection. – Steffen Ullrich Mar 20 '19 at 05:38

0 Answers0