Can I use Python multiprocessing to get this to run faster on windows

Question

I am having a hard time understanding how to leverage/learn how to use multiprocessing with my Python code. I am now processing csv files which are several gigs and tens of millions of records on a windows OS and am beginning to run into a massive processing speed bump. I have the following code:

import numpy as np
import pandas as pd
import datetime as dt

df = pd.read_csv(r'C:...\2017_import.csv')

df['FinalActualDate'] =  pd.to_datetime(df['FinalActualDate'])
df['StartDate'] =  pd.to_datetime(df['StartDate'])

df['DaysToInHome'] = (df['FinalActualDate'] - df['StartDate']).abs()  / np.timedelta64(1, 'D')

df.to_csv(r'C:...\2017_output4.csv', index=False)

The data is on file that is 3.6 gigs. The data looks like:

Class,OwnerCode,Vendor,Campaign,Cycle,Channel,Product,Week,FinalActualDate,State,StartDate
3,ECM,VendorA,000206,06-17,A,ProductB,Initial,2017-06-14 02:01:00,NE,06-01-17 12:00:00
3,ECM,VendorB,000106,06-17,A,ProductA,Initial,2017-06-14 00:15:00,NY,06-01-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,B,ProductB,Secondary,2017-06-13 20:30:00,MA,06-08-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,C,ProductA,Third,2017-06-15 02:13:00,NE,06-15-17 12:00:00

This code works on small data sets but it is taking several hours on the actual, large, data set. I have tried several iterations of the import concurrent.futures and multiprocessing with no success. I am so lost it is not worth me posting what I have tried. I do realize that other factors impact speeds but obtaining new hardware is not an option. Any guidance would be appreciated.

Are you operating on one giant file or many giant files? And what kind of file size are we talking? — Brenden Petersen, Oct 16 '17 at 18:17
What does your data look like? particularly, what does `'FinalScanActualDate` and `MailDate` columns look like? Are they regular? If so, you could gain **a lot** of performance by simply using a `format` parameter to `pd.to_datetime`, or perhaps try memoizing it, if you expect lots of repetition. That function uses a date-parser that infers your format, a costly operation. — juanpa.arrivillaga, Oct 16 '17 at 18:18
@BrendenPetersen Sorry, It is one giant 3.6 gig file with over 30 million records. I updated the question. — John Minze, Oct 16 '17 at 18:31
See [here](https://stackoverflow.com/questions/8717179/chunking-data-from-a-large-file-for-multiprocessing) for an example of using multiprocessing to read chunks of data from a CSV. I would also think about using something like HDF5 (and python package h5py) for managing data files that large. — Brenden Petersen, Oct 16 '17 at 18:32
@juanpa.arrivillaga Sorry, I should have included that. I updated the question with that information. I am open to using a format parameter. I am a noob so I having gotten to optimization yet. — John Minze, Oct 16 '17 at 18:33
Aha. So it *does* seem regular. I would try providing the proper `format` argument to `pd.read_csv`, this will *vastly* improve your performance. You could do this as well as memoizing it, just for good measure. You can use `functools.lru_cache` for this. If both of those doesn't give satisfactory results, *then* consider multiprocessing. — juanpa.arrivillaga, Oct 16 '17 at 18:35
@juanpa.arrivillaga So telling Python what the data format is saves it from having to do all that 'extra' work in figuring it out? I added the formatting after the 'pd.to_datetime(df['fields'])' and it seemed to speed it up a lot. Thanks! — John Minze, Oct 16 '17 at 19:05
@BrendenPetersen Thanks. I have seem many entries like that but I am hitting a wall in trying to get it into my code and working. — John Minze, Oct 16 '17 at 19:07
@JohnMinze yes. Check out my answer, I am getting >8x speedup. But yes, `pd.to_datetime` is handy because it will understand things like `"thursday, april 2 2012"` without having to specify a format. It uses the`dateutil` parser underneath the hood. However, the cost of flexibility is speed, which makes sense. — juanpa.arrivillaga, Oct 16 '17 at 19:08

juanpa.arrivillaga · Accepted Answer · 2017-10-16T18:51:38.723

Before you go off into multiprocessing, I would consider dealing with some low-hanging fruit (which you'll want to do regardless):

Consider:

In [15]: df
Out[15]:
   Class OwnerCode   Vendor       Campaign  Cycle Channel   Product  \
0      3       ECM  VendorA         000206  06-17       A  ProductB
1      3       ECM  VendorB         000106  06-17       A  ProductA
2      3       ECM      AID  ED-17-0002-06  06-17       B  ProductB
3      3       ECM      AID  ED-17-0002-06  06-17       C  ProductA

        Week      FinalActualDate State          StartDate
0    Initial  2017-06-14 02:01:00    NE  06-01-17 12:00:00
1    Initial  2017-06-14 00:15:00    NY  06-01-17 12:00:00
2  Secondary  2017-06-13 20:30:00    MA  06-08-17 12:00:00
3      Third  2017-06-15 02:13:00    NE  06-15-17 12:00:00

Since your date-time formats are regular, just pass the format argument. Doing a simple test:

In [16]: dates = df.StartDate.repeat(10000)

In [17]: len(dates)
Out[17]: 40000

In [18]: %timeit pd.to_datetime(df.StartDate)
1000 loops, best of 3: 866 µs per loop

In [19]: %timeit pd.to_datetime(df.StartDate, format="%m-%d-%y %H:%M:%S")
10000 loops, best of 3: 106 µs per loop

I got an 8x increase in speed. Unless you are working with well-over 8 cores, this is a much greater speed-up than you would expect by parallelizing it.

Can I use Python multiprocessing to get this to run faster on windows

1 Answers1