0

I would like to be able to carry out a task like the one shown in the image.A function that:

  1. Reads a dataset
  2. Applies some transformations
  3. Performs in parallel an export of the dataset to csv and while the csv is saved returns the dataset as pd.DataFrame

enter image description here

2 Answers2

2

You can do this:

import threading

def thread_function(df):
    df.to_csv()

def blue_function(df):
    thread = threading.Thread(target=thread_function, args=(df,))
    thread.start()

From the documnetation, calling thread.join() to wait for it to finish is not necessary, as:

the main thread is not a daemon thread and therefore all threads created in the main thread default to daemon = False . The entire Python program exits when no alive non-daemon threads are left.

EDIT:

Doing this, you spawn a new Thread for your process. This allows the OS to schedule the runtime of the procedures independently. The advantage of this is that one thread does not have to wait for the other to finish. Thus making your code Asynchronous, rather than parallel.

In other programming languages, it also allows the OS to schedule the threads to run on different CPU Cores, making them run in parallel. This, however is not possible in Python due to GIL, which blocks Python from running more than one instance of Interpreter at a time.

You could start a new process, instead of a thread, however that would only introduce more overhead (time and memory) and bring no advantages whatsoever. Chrome spawns processes instead of threads for browser tabs, but it does so for security reasons. Since processes don't share heap memory.

If you truly require running the tasks in parallel, your only option is to code the threading part in C and call that from Python.

Šimon Kocúrek
  • 1,591
  • 1
  • 12
  • 22
2

If I understand your question correctly this code is for you:

import pandas as pd
from multiprocessing import Lock, Process
from time import time

def writefile(df,lock, filename):
    lock.acquire()
    df.to_csv(filename, index=False, mode='a', header=False)
    lock.release()


if __name__ == '__main__':
    N = 10000000

    df = pd.DataFrame({'a':range(1, N),'b':range(1, N),'c':range(1, N)})
    filename= "tmp.csv"

    start = time()
    df.to_csv(filename, index=False, mode='a', header=False)
    print("Standard execution time:", time() - start, 'seconds')

    start = time()
    lock = Lock()
    p = Process(target=writefile, args=(df,lock, filename))
    p.start()
    p.join()
    print("Multiprocessing execution time:", time() - new, 'seconds')

Using multiprocessing way will consume more time than the default way. By using Synchronization between processes, use Processes and Lock to parallel the writing process.

Massifox
  • 4,369
  • 11
  • 31