python - Separate thread to write data in parallel makes my code slower - but why?

Question

My code flow is something like:

import pandas as pd
import threading
import helpers

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        # testing if the previous instance is running
        if isinstance(upload_thread, threading.Thread):
            if upload_thread.isAlive():
                print('waiting for the last upload op to finish')
                upload_thread.join()

        # starts the upload in another thread, so the loop can continue on the next chunk
        upload_thread = threading.Thread(target=helpers.uploading, kwargs=kwargs)
        upload_thread.start()

It works, the problem is: running it with threading makes it slower!

My idea of code flow is:

process a chunk of data
after its done, upload it on the background
while uploading, advance the loop to the next step, that is processing the next chunk of data

In theory, sounds great, but after a lot of trials and timing, I believe the threading is slowing down the code flow.

I'm sure I messed something up, please help me to find out what it is.

Also, this function 'helpers.uploading' returns important results to me. How can I access those results? Ideally I need to append the result of each iteration to a list of results. Without threading, this would be something like:

import pandas as pd
import helpers

results = []

for file in files:
    df_full = pd.read_csv(file, chunksize=500000)
    for df in df_full:
        df_ready = prepare_df(df)
        result = helpers.uploading(**kwargs)
        results.append(result)

Thanks!

Python's [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) only allows one thread to execute Python bytecode at a time. When you call `upload_thread.join()` in your parent thread you block your parent thread from going on and it will wait until the upload_thread returns. For returning values from threads you can use a `queue.Queue` like shown [here](https://stackoverflow.com/a/53432672/9059420). — Darkonaut, Feb 09 '19 at 19:09
Thanks Darkonaut! I'll give it a try and soon I'll bring news on this. — erickfis, Feb 12 '19 at 18:08

python - Separate thread to write data in parallel makes my code slower - but why?

0 Answers0