0

I have almost 7000 csv files, with an almost 2.4 million rows (in total). I've written code that will open the csv, do some calculations to add new columns. In the end I would like to vstack all these into one master csv/txt file.

an example of my code (please excuse any dumb mistakes as this is an example code):

def my_func(file):
    df = pd.read_csv(file)
    new_df = custom_calculations(df)
    return new_df
newarray = np.empty((85), int)
a = my_func(file)
newarray = np.vstack([newarray,a)]

I've been reading the documentation on threading, so that this will go faster. I followed some examples and came up with this code:

    for ii in csv_list:

        process = threading.Thread(target=my_func, args=[ii])
        process.start()
        threads.append(my_func(ii))
        print('process: ',type(process), process, )
    
    for process in threads:
        process.join()

It doesn't seem to be actually appending the arrays together though, and I'm not sure what I'm doing wrong.

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70

1 Answers1

1

First question : is it actually too slow to process the 7000 files sequentially ? Searching how to implement a multithreaded solution, writing the code, debugging it then running it may be slower than just running it now sequentially and waiting for one hour.

Second question : what is actually slow ? Doing the calculations or writing the results to the master file ?
Because these are different problems : computing is CPU-bound, so you should use multiple processes, while writing the CSV to the disk is IO-bound, so threads are better at this.

It is simpler to parallelize the computations because they are distinc from each other, while writing to an only file can not be parallelized as much.

There already exist answers for each side of the question :

(and many else)

Lenormju
  • 4,078
  • 2
  • 8
  • 22