0

Pandas read_parquet() called on a google cloud storage path hangs when using Python multiprocessing (even when limiting to 1 process).

def read_chunk(*args):
    df = pd.read_parquet(GS_PATH)
    # Do stuff

num_files = 1000
with ProcessPoolExecutor(max_workers=2) as executor:
    futures = {}
    while num_files > 0:
        for i in range(num_files):
            future = executor.submit(read_chunk, *args)

It does not hang when reading a local file or using multi-threading with the same Google cloud storage path.

It also does not hang when using the following code:

def read_chunk(*args):
    storage_client = storage.Client()
    # Create a bucket object for our bucket
    bucket = storage_client.get_bucket("bucket")
    # Create a blob object from the filepath
    blob = bucket.blob(GS_PATH)
    # Download the file to a destination
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    # Do stuff

num_files = 1000
with ProcessPoolExecutor(max_workers=2) as executor:
    futures = {}
    while num_files > 0:
        for i in range(num_files):
            future = executor.submit(read_chunk, *args)

Can someone help me understand what might be causing this?

To solve this I wrote the second chunk of code I wrote above which worked but is way slower so would like to find a way to make the original code work.

Edit: emphasizing that my issue isn’t that I can’t read from GCS but that I cannot in the context of multiprocessing. This therefore is not a duplicate. I have thoroughly researched this issue and found nothing similar unfortunately

John
  • 11
  • 1
  • 2
  • Does this answer your question? [How to create pandas dataframe from parquet files kept on google storage](https://stackoverflow.com/questions/60394889/how-to-create-pandas-dataframe-from-parquet-files-kept-on-google-storage) – Robert G Jun 09 '23 at 21:01
  • Thanks, my issue is not that I can’t read from gcs but that I can’t in the context of multi processing – John Jun 10 '23 at 01:42

0 Answers0