I would like to write a scraper that has 3 different "groups" of attributes (or data) that would [and likely should] be kept separately.
I was hoping to use DataClasses and aim at Pythonic practices, but DataClasses don't feel appropriate for reasons stated in more detail later.
The 3 groups [or "interfaces"] are as follows:
#1: HTTP Header fields
- has defaults, but needs to be mutable at/after object instantiation of the (#3) request class object
- ideally acts like a dict when using a request method inside #3 request object
#2: API parameters
for the URL query request
- has defaults, but also needs to be mutable at/after instantiation
- ideally acts like a dict when using a request method inside #3 request object
#3:Response Object
(the data) after the request is returned to the user from the API server.
- I would later implement methods for the object to have output formats such as
CSV
,JSON
,SQL DB
,S3
, etc. That would be [at least] a 4th interface.
The Task I've been trying to Accomplish
I want an interface where a user can instantiate a class, e.g. Player
with the API params they need and are also update HTTP header (as needed).
Here's my current code (pic form):
The HTTP Header
and the API Params
are both easily stored as Python dicts
(or JSON). I have included them below.
=> The question is how do I make them mutable in the Request object (Class) at instantiation (creation) and able to be updated after instantiation (creation)?
Inheritance via DataClassses? I have tried to put these dictionaries in DataClasses but they don't like them link as it's a hack to try to get around the
default_factory
usingfield
from the dataclass module. It's possible, but defeats using Dataclasses to avoid all the extra syntax. Using Dataclasses also makes it so theMyDataClass.__dict__
has way more stuff to it thanPythonClass.__dict__
. => Thus use a regular Python Class or Dict...Using a Regular Python Class: There seems to be two options to allow mutability of the HTTP Header at creation. 1)
Inheritance
, but that muddies the waters of the attributes of the HTTP Header with the API Params. 2)Composition
, setting an attribute field to the HTTPClassHeader and doing some work to be able to convert back to adict
to use in therequest_data()
method.Putting the Dicts into the Players (Request Class) doesn't allow mutability via a nice keyword interface (or I'm not aware how to implement it).
Here's my code in text form:
class Players:
__endpoint__ = "CommonallPlayers"
def __init__(self, IsOnlyCurrentSeason=0, LeagueID="00", Season="2021-22", header= HTTPHeader) -> None:
# these first 3 attributes constitute the (#2) API Params
self.IsOnlyCurrentSeason = IsOnlyCurrentSeason
self.LeagueID = LeagueID
self.Season = Season
self.header = HTTPHeader # (1) inherit as a Class or Dict?
def encode_api_params(self):
return self.__dict__ # if only 3 attributes, this works, but not if I add more attributes HTTP or self.request_data
def get_http_header(self):
# ideally can return the http_header as a dict
pass
# ideally this is NOT instantiated (as doesn't have data, shouldn't be accessible to user until AFTER request)
def request_data(self):
url_api = f"{BASE_URL}/{self.__endpoint__}"
return requests.get(url_api,
params=self.encode_api_params(),
headers=self.get_http_header())
# works, has current defaults (current season)
c = Players()
# a common use case, using a different Season than the default (current season)
c = Players(Season="1999-00")
# A possible needed change, with 2 possible desired interface
c = Players(Season="1999-00", header={"Referer": "https://www.another-website.com/"})
c = Players(Season="1999-00").header(Referer="https://www.another-website.com/")
# Final outputs
c.request_data().to_csv("downloads/my_data.csv")
c.request_data().to_sql("table-name")
Here's the HTTP HEADER
, the API Params
, and Request Object
in the simplest form are as follows (running these together would return some data):
HTTP_HEADER = {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Host": "stats.nba.com",
"Origin": "https://www.nba.com",
"Referer": "https://www.nba.com/",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-site",
"Sec-GPC": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
"x-nba-stats-origin": "stats",
"x-nba-stats-token": "true",
}
params = {'IsOnlyCurrentSeason': 0, 'LeagueID': '00', 'Season': '2021-22'}
r = requests.get("https://stats.nba.com/stats/commonallplayers", # base url
params=params, # expects (#2) params, the api parameters to be a dict
headers=headers) # expects (#1) headers to be a dict
r.json()