1

I have a script that uses multiprocessing to open and perform calculation on ~200k .csv files. Here's the workflow:

1) Considering a folder with ~200k .csv files. Each .csv file contains the folowing:

.csv file example:

0, 1
2, 3
4, 5
...
~500 rows

2) The script saves a list of all .csv files in a list()

3) The script divides the list with ~200k .csv files into 8 lists since I have 8 processors available.

4) The script calls do_something_with_csv() 8 times and perform calculation in parallel.

In linear mode, the execution takes around 4 min.

In parallel and in series, if I execute the script for the first time, it takes a much longer time. If I execute for the second, third etc. time, it takes around 1min. Seems like python is caching the IO operations of some sort? It looks like because I have a progress bar, and for example, if I execute until the progress bar is 5k/200k and terminate the program, the next execution will go through the first 5k runs very quickly and then slow down.

Python version: 3.6.1

Pseudo Python code:

def multiproc_dispatch():
        lst_of_all_csv_files = get_list_of_files('/path_to_csv_files')
        divided_lst_of_all_csv_files = split_list_chunks(lst_of_all_csv_files, 8)

        manager = Manager()
        shared_dict = manager.dict()

        jobs = []
        for lst_of_all_csv_files in divided_lst_of_all_csv_files:
            p = Process(target=do_something_with_csv, args=(shared_dict, lst_of_all_csv_files))
            jobs.append(p)
            p.start()

        # Wait for the worker to finish
        for job in jobs:
            job.join()

def read_csv_file(csv_file):
    lst_a = []
    lst_b = []
    with open(csv_file, 'r') as f_read:
        csv_reader = csv.reader(f_read, delimiter = ',')
        for row in csv_reader:
            lst_a.append(float(row[0]))
            lst_b.append(float(row[1]))
    return lst_a, lst_b


def do_something_with_csv(shared_dict, lst_of_all_csv_files):
    temp_dict = lambda: defaultdict(self.mydict)()
    for csv_file in lst_of_all_csv_files:
        lst_a, lst_b = read_csv_file(csv_file)
        temp_dict[csv_file] = (lst_a, lst_b)

    shared_dict.update(temp_dict)


if __name__ == '__main__':
    multiproc_dispatch()

Raphael
  • 959
  • 7
  • 21

1 Answers1

3

This is without a doubt RAM caching coming into play, meaning that loading your files is faster the second time as data is already in RAM and is not coming from disk. (struggling to find good references here, any help welcome) This has nothing to do with multiprocessing, not even with python itself.

Irrelevant since question edit I think the cause of the longer duration taken by your code when run in parallel comes from your shared_dict variable that is accessed from within each subprocess (see e.g. here). Creating and sending data between processes in python is slow and should be reduced to minimum (here you could return one dict per job then merge them).

Théo Rubenach
  • 484
  • 1
  • 4
  • 12
  • My shared dict is already doing what you proposed in the original code. That is, the shared dict is updated 8 times only.. I will update the question. I don't understand how can the object be on RAM if I terminate the program. I think it may be something to do with garbage collection. – Raphael Feb 12 '20 at 15:00
  • 1
    Ok for your dict. Garbage collection means that your RAM is free to be used by other programs; however OS does not "reset" it; if your reload the same file shortly after, it will still be in cache. This is of course not visible from python but from an OS-level. This is why ram monitoring tools have a "cache" category. Read more about RAM caching ! – Théo Rubenach Feb 12 '20 at 15:04
  • I understand, but what I think very strange is that I'm monitoring RAM usage and it doesn't increase as it should if all 200k files were cached. I even tested with 5M .csv files, after the second run the execution is faster and the RAM usage is always the same. – Raphael Feb 12 '20 at 15:12
  • 1
    Cached RAM is considered free RAM; usually almost all free RAM is used as cached from previous computations (not only from python). So maybe your new cache just erased some old cache (provided you do look at Cached ram and not free ram) – Théo Rubenach Feb 12 '20 at 15:16
  • 1
    Now, that makes more sense. Thanks. – Raphael Feb 12 '20 at 15:17
  • Just another comment. When the code is running slow (without cached RAM) the CPUs are at few % usage, and when it's running faster, the CPU usage is 100%. Is this expected? I was logically expecting the opposite. – Raphael Feb 12 '20 at 15:19
  • Well I guess the bottleneck is coming, in the first case, from file reading, not from your CPU limitations, meanwhile once everything is loaded this bottleneck disappears ? – Théo Rubenach Feb 12 '20 at 15:23
  • Yes, once everything is loaded, the execution is as expected. – Raphael Feb 12 '20 at 15:28