for og_raw_file in de_core.file.rglob(raw_path_object.url):
with de_core.file.open(og_raw_file, mode="rb") as raw_file, de_core.file.open(
staging_destination_path + de_core.aws.s3.S3FilePath(raw_file.name).file_name, "wb"
) as stager_file, concurrent.futures.ThreadPoolExecutor() as executor:
logger.info("Submitting file to thread to add metadata", raw_file=raw_file)
executor.submit(
<long_length_metadata_function_that_I_want_to_parallize>,
raw_path_object,
<...rest of arguments to function>
)
I want every file to be processed in a separate thread all at once and for the submit not to be blocking. What am I doing wrong? What happens is that each file is submitted one at a time but the next file isn't submitted until the previous one finishes... how do I parallelize this properly?
I would expect the "Submitting file to thread to add metadata" to appear quickly for every file at the beginning since the threads should be submitted and then forgot, but that's not what's happening.
Do I need to do something like this? Why?
future_mapping = {executor.submit(predicate, uri): uri for uri in uris}
for future in concurrent.futures.as_completed(future_mapping):
The metadata function is basically adding columns to a parquet file. Is this not something I can use threads for given the Python gil?