I'm trying to use the Socrata API to download this dataset on 311 calls in NYC since 2010. The dataset is 22 million rows. I've never worked with APIs before, and am unsure of the best way to download this dataset - I've written a snippet of code below to grab the data in chunks of 2,000 rows, but according to my calculations this would take 10,000 minutes, as each chunk of 2,000 takes one minute.
data_url = 'data.cityofnewyork.us'
dataset = 'erm2-nwe9'
app_token = app_token
client = Socrata(data_url, app_token)
client.timeout = 6000
record_count = client.get(dataset, select='COUNT(*)')
start = 0
chunk_size = 2000
results = []
while True:
print(start)
results.extend(client.get(dataset, limit=chunk_size))
start = start+chunk_size
if (start > int(record_count[0]['COUNT'])):
break
df=pd.DataFrame(results)
I don't need all of the data; I could just use the data from 2015, though I'm not sure how to indicate this in the request. The 'data' column is formatted like '2020-04-27T01:59:21.000'. How can I go about getting the dataset for post-2015, if getting the entire dataset is unreasonable? In general, is 22 million records considered too large of a request? And generally, is it best practice to break the request into chunks, or to try to get the dataset in one go, using get_all()?