Best way to download large datasets in Python?

Question

I'm trying to use the Socrata API to download this dataset on 311 calls in NYC since 2010. The dataset is 22 million rows. I've never worked with APIs before, and am unsure of the best way to download this dataset - I've written a snippet of code below to grab the data in chunks of 2,000 rows, but according to my calculations this would take 10,000 minutes, as each chunk of 2,000 takes one minute.

data_url = 'data.cityofnewyork.us'
dataset = 'erm2-nwe9'
app_token = app_token
client = Socrata(data_url, app_token)
client.timeout = 6000

record_count = client.get(dataset, select='COUNT(*)')

start = 0
chunk_size = 2000
results = []
while True:
    print(start)
    results.extend(client.get(dataset, limit=chunk_size))
    start = start+chunk_size
    if (start > int(record_count[0]['COUNT'])):
        break

df=pd.DataFrame(results)

I don't need all of the data; I could just use the data from 2015, though I'm not sure how to indicate this in the request. The 'data' column is formatted like '2020-04-27T01:59:21.000'. How can I go about getting the dataset for post-2015, if getting the entire dataset is unreasonable? In general, is 22 million records considered too large of a request? And generally, is it best practice to break the request into chunks, or to try to get the dataset in one go, using get_all()?

Can you download the CSV and just use that? It looks like if you click export you can access it in different formats here: https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 — seanpue, Apr 28 '20 at 23:59
@seanpue i'm trying to, but it's taking a really long time...i also want to learn how to work with an api / threading, so i'm also using this as an exercise in doing that. — Christine Jiang, Apr 29 '20 at 00:08
_I don't need all of the data; I could just use the data from 2015, though I'm not sure how to indicate this in the request._ That information should (hopefully) be in the documentation for the API. — AMC, Apr 29 '20 at 01:58
@ChristineJiang Oh okay. Have you looked at `multiprocessing`? — seanpue, Apr 29 '20 at 13:45

score 0 · Answer 1 · answered Apr 29 '20 at 00:07

There are a few options. First, you may want to look at the "sodapy" library which supports easier downloading from Socrata data portals. It has built-in some of the backend handling for larger data sets, namely, paging.

Second, you can leverage the API to filter the data, including dates. There are a lot of examples, including this answer, that will help you get started with a query. You can combine this with "sodapy" as well.

Best way to download large datasets in Python?

1 Answers1