my following code reads several csv files in one folder, filters according the value in a column, and then appends the resulting dataframe to a csv file. Given that there are about 410 files 130 MB each, this code currently takes about 30 min. I was wondering if there is a quick way to make it faster my using a multiprocessing library. Could you offer me some tips on how to get it started? thank you
import pandas as pd
import glob
path =r'C:\Users\\Documents\\'
allfiles = glob.glob(path + "*.csv")
with open('test.csv','w') as f:
for i,file in enumerate(allfiles):
df = pd.read_csv(file,index_col=None, header=0)
df.sort_values(['A','B','C'], ascending = True, inplace = True)
df['D'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
df[(df['D'] == 1) | (df['D'] == 0)].to_csv(f, header = False)
print i
print "Done"