scraping python requests soraredata

Question

Hello I am trying to retrieve the json of soraredata by this link but it returns me a source code without json. When I put this link in a software called Insomnia it happens to have the json so I think it must be possible with requests? sorry for my english i use the translator.

edit : the link seems to work without the "my_username" so url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football"

I get a status code 403, I don't know what is missing to get 200?

Thank you

headers = {
    "Host" : "www.soraredata.com",    
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0",
    "Referer" : "https://www.soraredata.com/rankings",
    }

#url = "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/{my_username}/0/sr_football"

res = requests.get(url, headers=headers)
html = res.text
#html = json.loads(html)    

print(html)

403 means authorization is needed. Is that a public API that you can use freely? — Ken Hung, Jul 16 '22 at 04:02
I don't know at all I don't know much about it unfortunately, I have the impression that in the source code it talks about captcha but I never seen it on the website — Maxime Lara, Jul 16 '22 at 04:23

score 1 · Accepted Answer · answered Jul 16 '22 at 05:23

Here is a solution I got to work.

import http.client
import json
import socket
import ssl
import urllib.request

hostname = "www.soraredata.com"
path = "/api/stats/newFullRankings/all/false/all/7/0/sr_football"
http_msg = "GET {path} HTTP/1.1\r\nHost: {host}\r\nAccept-Encoding: identity\r\nUser-Agent: python-urllib3/1.26.7\r\n\r\n".format(
    host=hostname,
    path=path
).encode("utf-8")

sock = socket.create_connection((hostname, 443), timeout=3.1)
context = ssl.create_default_context()

with sock:
    with context.wrap_socket(sock, server_hostname=hostname) as ssock:
        ssock.sendall(urllib3_msg)
        response = http.client.HTTPResponse(ssock, method="GET")
        response.begin()
        print(response.status, response.reason)
        data = response.read()

resp_data = json.loads(data.decode("utf-8"))

What was perplexing is that the HTTP message I used was the exact same one used by urllib3, as indicated when debugging the following code. (See the this answer for how to set up logging to debug requests, which also works for urllib3.)

Yet, this code gave a 403 HTTP status code.

import urllib3

http = urllib3.PoolManager()

r = http.request(
    "GET",
    "https://www.soraredata.com/api/stats/newFullRankings/all/false/all/7/0/sr_football",
)
assert r.status == 403

Moreover http.client also gave a 403 status code, and it seems to be doing pretty much what I did above: wrap a socket in an SSL context and send the request.

conn = http.client.HTTPSConnection(hostname)
conn.request("GET", path)
res = conn.getresponse()
assert res.status == 403

score 1 · Answer 2 · answered Jul 16 '22 at 05:50

1

Thank you ogdenkev!

I also found this but it doesn't always work

import cloudscraper
import json
scraper = cloudscraper.create_scraper()
r = scraper.get(url,).text 
y = json.loads(r)
print (y)

answered Jul 16 '22 at 05:50

Maxime Lara

83
1
7

Interesting. Thanks for sharing. It makes sense that Cloudfare was denying the request based on something that requests or urllib3 were doing. I just wasn't able to figure out what that was. – ogdenkev Jul 16 '22 at 12:55

scraping python requests soraredata

2 Answers2