My code flow is something like:
import pandas as pd
import threading
import helpers
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
# testing if the previous instance is running
if isinstance(upload_thread, threading.Thread):
if upload_thread.isAlive():
print('waiting for the last upload op to finish')
upload_thread.join()
# starts the upload in another thread, so the loop can continue on the next chunk
upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
upload_thread.start()
It works, the problem is: running it with threading makes it slower!
My idea of code flow is:
process a chunk of data
after its done, upload it on the background
while uploading, advance the loop to the next step, that is processing the next chunk of data
In theory, sounds great, but after a lot of trials and timing, I believe the threading is slowing down the code flow.
I'm sure I messed something up, please help me to find out what it is.
Also, this function 'helpers.uploading' returns important results to me. How can I access those results? Ideally I need to append the result of each iteration to a list of results. Without threading, this would be something like:
import pandas as pd
import helpers
results = []
for file in files:
df_full = pd.read_csv(file, chunksize=500000)
for df in df_full:
df_ready = prepare_df(df)
result = helpers.uploading(**kwargs)
results.append(result)
Thanks!