0

my following code reads several csv files in one folder, filters according the value in a column, and then appends the resulting dataframe to a csv file. Given that there are about 410 files 130 MB each, this code currently takes about 30 min. I was wondering if there is a quick way to make it faster my using a multiprocessing library. Could you offer me some tips on how to get it started? thank you

import pandas as pd
import glob

path =r'C:\Users\\Documents\\'
allfiles = glob.glob(path + "*.csv")

with open('test.csv','w') as f:
    for i,file in enumerate(allfiles):
        df = pd.read_csv(file,index_col=None, header=0)
        df.sort_values(['A','B','C'], ascending = True, inplace = True)
        df['D'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
        df[(df['D'] == 1) | (df['D'] == 0)].to_csv(f, header = False)

        print i
    print "Done"
dleal
  • 2,244
  • 6
  • 27
  • 49
  • 1
    did you try to do a profiling? what is the slowest part? multiprocessing is NOT a remedy for everything - sometimes it makes it even slower... What is the speed of your IO system? – MaxU - stand with Ukraine Aug 17 '16 at 21:56
  • 1
    I agree with @MaxU, if you're just reading/writing on a single machine I suspect that disk I/O might be the bottleneck (which isn't fixed through parallelization) – Kris Aug 17 '16 at 22:06
  • HI, thank you for your replies. Indeed the slowest part is reading the files. I was hoping I would be able to read one file in each core, process them, and then append them to the same csv file. – dleal Aug 17 '16 at 22:09
  • If i were you i would consider moving the data to some relational DB (MySQL, PostgreSQL) or to use HDF files instead of CSV files - they are usually much faster. [Here is the result of my comparison](http://stackoverflow.com/a/37012035/5741205) – MaxU - stand with Ukraine Aug 17 '16 at 22:16
  • thank you both. I've been exploring Dask and HDF5. I havent been able to figure out how to use them to use Pandas with datasets that I cant fit in memory – dleal Aug 17 '16 at 22:28
  • @dleal, [here](http://stackoverflow.com/a/38472574/5741205) is an example of "How to import a gzip file larger than RAM limit into a Pandas DataFrame and write it to HDF5" – MaxU - stand with Ukraine Aug 17 '16 at 23:08
  • @dleal, your task is not CPU heavy, and you intend to write to a single output file which means you'll have to use an output buffer. The former means that cores will just be waiting on the file to be read before performing a computation which takes a few milliseconds, then wait on the output buffer to dump the results. The latter will have to set locks, or manage queues which will just increase the overhead. So multiprocessing will probably slow things down even more... Best use HDF5 for your task. – Kartik Aug 17 '16 at 23:49
  • Not to mention that you are on a Windows machine, and Windows has its own weird way of handling processes spawning other processes and communications between them. You will end up with a huge memory bloat that will crash your machine or too convoluted a code to run fast and efficiently. To write a proper MP program that handles all these things will require days of development. Again better off with HDF5 which you can piece together in less than a couple of hours of development time. – Kartik Aug 17 '16 at 23:52
  • Thank you Kartik, I appreciate your reply. I am looking into HDF5 now. Also, thanks to MaxU for the link. – dleal Aug 17 '16 at 23:54

0 Answers0