Reading multiple files using multiprocessing in Python and concatenating read values

Question

I have 100s of csv files each storing same number of columns. Instead of reading them one at a time I want to implement multiprocessing.

For representation I have created 4 files: Book1.csv, Book2.csv, Book3.csv, Book4.csv and they store numbers 1 though 5 in each of them in column A starting row 1.

I am trying the following:

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    p = multiprocessing.Pool()

    for f in fname:
        p.apply_async(process, [f])

    p.close()
    p.join()

I got the idea for above code from the link.

But the above code is not producing the desired result which I expected would be:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5

Edit: I want to load each of the file in separate processor and combine the file contents. Since I have 100s of files to load and combine the contents, I was hoping to make the process faster if I was loding 4 files (my PC has 4 processors) at a time.

I don't see that your code is producing _any_ output, let alone the expected output. What are you trying to achieve? How do you want to process the data? — mhawke, Nov 29 '16 at 23:09
if working with a big amount of tabular data is in your frequent workflow, you could have a look at dask:http://dask.pydata.org/en/latest/ — Arco Bast, Nov 30 '16 at 00:06
Your code discards the dataframes after they are returned to the parent process. You could replace the `for` loop with `dataframes = pool.map(process, fname)` and get them in a list. Considering the operation is I/O bound and you add overhead passing the dataframe from child to parent, you may find this takes longer than just reading them in 1 process. — tdelaney, Nov 30 '16 at 01:03

score 3 · Accepted Answer · answered Aug 27 '20 at 12:14

Try this

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    

    with multiprocessing.pool(5) as p: #Create a pool of 5 workers
        result = p.map(process, fname)
    print(len(result))

Reading multiple files using multiprocessing in Python and concatenating read values

1 Answers1