I am having a hard time understanding how to leverage/learn how to use multiprocessing with my Python code. I am now processing csv files which are several gigs and tens of millions of records on a windows OS and am beginning to run into a massive processing speed bump. I have the following code:
import numpy as np
import pandas as pd
import datetime as dt
df = pd.read_csv(r'C:...\2017_import.csv')
df['FinalActualDate'] = pd.to_datetime(df['FinalActualDate'])
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['DaysToInHome'] = (df['FinalActualDate'] - df['StartDate']).abs() / np.timedelta64(1, 'D')
df.to_csv(r'C:...\2017_output4.csv', index=False)
The data is on file that is 3.6 gigs. The data looks like:
Class,OwnerCode,Vendor,Campaign,Cycle,Channel,Product,Week,FinalActualDate,State,StartDate
3,ECM,VendorA,000206,06-17,A,ProductB,Initial,2017-06-14 02:01:00,NE,06-01-17 12:00:00
3,ECM,VendorB,000106,06-17,A,ProductA,Initial,2017-06-14 00:15:00,NY,06-01-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,B,ProductB,Secondary,2017-06-13 20:30:00,MA,06-08-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,C,ProductA,Third,2017-06-15 02:13:00,NE,06-15-17 12:00:00
This code works on small data sets but it is taking several hours on the actual, large, data set. I have tried several iterations of the import concurrent.futures and multiprocessing with no success. I am so lost it is not worth me posting what I have tried. I do realize that other factors impact speeds but obtaining new hardware is not an option. Any guidance would be appreciated.