I'm trying to combine every block file from the 2010 census together into a single master block file for the US. I'm currently doing this in Google Colab and even on their pro subscription - which gives you about 25GB of RAM - I'm maxing out all available memory on the 45th file (I just have 5 more to go!). Code wise, I'm just building a list of dataframes that need to be concat
ed together and ultimately written to disk:
gdfs = []
census_blocks_basepath = r'/content/drive/My Drive/Census/blocks/'
census_block_filenames = [f for f in os.listdir(census_blocks_basepath) if f.endswith('.shp')]
for index, block_filename in enumerate(census_block_filenames):
file_name = os.path.join(census_blocks_basepath, block_filename)
gdfs.append(gpd.read_file(file_name))
print('Appended file %s, %s' % (index, block_filename))
gdf = gpd.GeoDataFrame(pd.concat(gdfs, ignore_index=True), crs=dataframesList[0].crs)
# gdf.reset_index(inplace=True, drop=True)
gdf.head(3)
Instead, I think I should:
- load a single geodataframe
- append it to a master dataframe that exists on disk (rather than in memory like csv.writer)
- delete the loaded geodataframe from
1
(to avoid memory accrual) - then repeat
1
-3
for all geodataframes remaining in the source directory
I don't see documentation on whether geopandas supports disk based appends.. it only seems able to overwrite previous files via GeoDataFrame.to_file
. That said, I see that geopandas has a GeoDataFrame.to_postgis
method with a chunksize
argument, which makes me think that it's possible to append data onto a geofile on disk (or I'm wrong and that's just a feature of postgis
).
Any ideas?