I'm trying to build a small program that uses the censusgeo
package to interact with the US Census Bureau API address batch facility. The API has a limit of 10,000 addresses in any single call but my dataframe has approx. 3 million rows. As such I want to split the dataframe into N parts, each comprising roughly 10,000 rows, and then feed each one into the API call, extract the output and append it all together.
I found this stackoverflow post which has been quite helpful in giving me a function to split my df. It doesn't return dataframes though (e.g. they don't show up if I run %who_ls DataFrame
) and I don't know how to call the outputs individually in order to then feed them into an API call.
This is the function i'm using to split the dataframe:
def split_dataframe(df, chunk_size = 10000):
chunks = list()
num_chunks = math.ceil(len(df) / chunk_size)
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
How do I refer to the chunks that are returned from that function? And is the best way to proceed simply to loop over them and feed them into the API call? I.e. something like:
for i in chunks:
censusgeocode --csv batch_i.csv
Or is there a smarter/more efficient way to do this?
Any pointers folks can give would be appreciated!