Pandas read_parquet() called on a google cloud storage path hangs when using Python multiprocessing (even when limiting to 1 process).
def read_chunk(*args):
df = pd.read_parquet(GS_PATH)
# Do stuff
num_files = 1000
with ProcessPoolExecutor(max_workers=2) as executor:
futures = {}
while num_files > 0:
for i in range(num_files):
future = executor.submit(read_chunk, *args)
It does not hang when reading a local file or using multi-threading with the same Google cloud storage path.
It also does not hang when using the following code:
def read_chunk(*args):
storage_client = storage.Client()
# Create a bucket object for our bucket
bucket = storage_client.get_bucket("bucket")
# Create a blob object from the filepath
blob = bucket.blob(GS_PATH)
# Download the file to a destination
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
# Do stuff
num_files = 1000
with ProcessPoolExecutor(max_workers=2) as executor:
futures = {}
while num_files > 0:
for i in range(num_files):
future = executor.submit(read_chunk, *args)
Can someone help me understand what might be causing this?
To solve this I wrote the second chunk of code I wrote above which worked but is way slower so would like to find a way to make the original code work.
Edit: emphasizing that my issue isn’t that I can’t read from GCS but that I cannot in the context of multiprocessing. This therefore is not a duplicate. I have thoroughly researched this issue and found nothing similar unfortunately