3

I have 100s of csv files each storing same number of columns. Instead of reading them one at a time I want to implement multiprocessing.

For representation I have created 4 files: Book1.csv, Book2.csv, Book3.csv, Book4.csv and they store numbers 1 though 5 in each of them in column A starting row 1.

I am trying the following:

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    p = multiprocessing.Pool()

    for f in fname:
        p.apply_async(process, [f])

    p.close()
    p.join()

I got the idea for above code from the link.

But the above code is not producing the desired result which I expected would be:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5

Edit: I want to load each of the file in separate processor and combine the file contents. Since I have 100s of files to load and combine the contents, I was hoping to make the process faster if I was loding 4 files (my PC has 4 processors) at a time.

Community
  • 1
  • 1
Zanam
  • 4,607
  • 13
  • 67
  • 143
  • 1
    I don't see that your code is producing _any_ output, let alone the expected output. What are you trying to achieve? How do you want to process the data? – mhawke Nov 29 '16 at 23:09
  • if working with a big amount of tabular data is in your frequent workflow, you could have a look at dask:http://dask.pydata.org/en/latest/ – Arco Bast Nov 30 '16 at 00:06
  • 1
    Your code discards the dataframes after they are returned to the parent process. You could replace the `for` loop with `dataframes = pool.map(process, fname)` and get them in a list. Considering the operation is I/O bound and you add overhead passing the dataframe from child to parent, you may find this takes longer than just reading them in 1 process. – tdelaney Nov 30 '16 at 01:03
  • @ tdelaney what do you mean by "reading them in 1 process"? – Zanam Nov 30 '16 at 02:03

1 Answers1

3

Try this

import pandas as pd
import multiprocessing
import numpy as np

def process(file):
    return pd.read_csv(file)

if __name__ == '__main__':
    loc = r'I:\Sims'
    fname = [loc + '\Book1.csv', loc + '\Book2.csv', loc + '\Book3.csv', loc + '\Book4.csv']
    

    with multiprocessing.pool(5) as p: #Create a pool of 5 workers
        result = p.map(process, fname)
    print(len(result))
Prabakar
  • 174
  • 2
  • 4
  • 14