How can I run a parallel task as follows in Python?

Question

I would like to be able to carry out a task like the one shown in the image.A function that:

Reads a dataset
Applies some transformations
Performs in parallel an export of the dataset to csv and while the csv is saved returns the dataset as pd.DataFrame

From your diagram it is not clear as to why you would need any threading. Just run the `to_csv()` as a part of the blue program. — Šimon Kocúrek, Sep 22 '19 at 13:09
@MisterMiyagi I would like to paint the result ina webpage while as a secondary task the program keeps saving the dataframe as csv. — Borja Fernández Antelo, Sep 22 '19 at 13:10

Šimon Kocúrek · Answer 1 · 2019-09-22T13:39:06.937

You can do this:

import threading

def thread_function(df):
    df.to_csv()

def blue_function(df):
    thread = threading.Thread(target=thread_function, args=(df,))
    thread.start()

From the documnetation, calling thread.join() to wait for it to finish is not necessary, as:

the main thread is not a daemon thread and therefore all threads created in the main thread default to daemon = False . The entire Python program exits when no alive non-daemon threads are left.

EDIT:

Doing this, you spawn a new Thread for your process. This allows the OS to schedule the runtime of the procedures independently. The advantage of this is that one thread does not have to wait for the other to finish. Thus making your code Asynchronous, rather than parallel.

In other programming languages, it also allows the OS to schedule the threads to run on different CPU Cores, making them run in parallel. This, however is not possible in Python due to GIL, which blocks Python from running more than one instance of Interpreter at a time.

You could start a new process, instead of a thread, however that would only introduce more overhead (time and memory) and bring no advantages whatsoever. Chrome spawns processes instead of threads for browser tabs, but it does so for security reasons. Since processes don't share heap memory.

If you truly require running the tasks in parallel, your only option is to code the threading part in C and call that from Python.

Thank u in advance @Šimon Kocúrek. Why do you call it a thread instead of a parallel process, since it is a task that just involves saving a df as csv? I am new in threading and multiprocessing. — Borja Fernández Antelo, Sep 22 '19 at 13:15
@BorjaFernándezAntelo I've added an edit to answer your question. — Šimon Kocúrek, Sep 22 '19 at 13:32

Massifox · Answer 2 · 2019-09-22T13:33:13.627

If I understand your question correctly this code is for you:

import pandas as pd
from multiprocessing import Lock, Process
from time import time

def writefile(df,lock, filename):
    lock.acquire()
    df.to_csv(filename, index=False, mode='a', header=False)
    lock.release()


if __name__ == '__main__':
    N = 10000000

    df = pd.DataFrame({'a':range(1, N),'b':range(1, N),'c':range(1, N)})
    filename= "tmp.csv"

    start = time()
    df.to_csv(filename, index=False, mode='a', header=False)
    print("Standard execution time:", time() - start, 'seconds')

    start = time()
    lock = Lock()
    p = Process(target=writefile, args=(df,lock, filename))
    p.start()
    p.join()
    print("Multiprocessing execution time:", time() - new, 'seconds')

Using multiprocessing way will consume more time than the default way. By using Synchronization between processes, use Processes and Lock to parallel the writing process.

How can I run a parallel task as follows in Python?

2 Answers2