0

I am having a hard time understanding how to leverage/learn how to use multiprocessing with my Python code. I am now processing csv files which are several gigs and tens of millions of records on a windows OS and am beginning to run into a massive processing speed bump. I have the following code:

import numpy as np
import pandas as pd
import datetime as dt

df = pd.read_csv(r'C:...\2017_import.csv')

df['FinalActualDate'] =  pd.to_datetime(df['FinalActualDate'])
df['StartDate'] =  pd.to_datetime(df['StartDate'])

df['DaysToInHome'] = (df['FinalActualDate'] - df['StartDate']).abs()  / np.timedelta64(1, 'D')

df.to_csv(r'C:...\2017_output4.csv', index=False)  

The data is on file that is 3.6 gigs. The data looks like:

Class,OwnerCode,Vendor,Campaign,Cycle,Channel,Product,Week,FinalActualDate,State,StartDate
3,ECM,VendorA,000206,06-17,A,ProductB,Initial,2017-06-14 02:01:00,NE,06-01-17 12:00:00
3,ECM,VendorB,000106,06-17,A,ProductA,Initial,2017-06-14 00:15:00,NY,06-01-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,B,ProductB,Secondary,2017-06-13 20:30:00,MA,06-08-17 12:00:00
3,ECM,AID,ED-17-0002-06,06-17,C,ProductA,Third,2017-06-15 02:13:00,NE,06-15-17 12:00:00

This code works on small data sets but it is taking several hours on the actual, large, data set. I have tried several iterations of the import concurrent.futures and multiprocessing with no success. I am so lost it is not worth me posting what I have tried. I do realize that other factors impact speeds but obtaining new hardware is not an option. Any guidance would be appreciated.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
John Minze
  • 127
  • 12
  • Are you operating on one giant file or many giant files? And what kind of file size are we talking? – Brenden Petersen Oct 16 '17 at 18:17
  • What does your data look like? particularly, what does `'FinalScanActualDate` and `MailDate` columns look like? Are they regular? If so, you could gain **a lot** of performance by simply using a `format` parameter to `pd.to_datetime`, or perhaps try memoizing it, if you expect lots of repetition. That function uses a date-parser that infers your format, a costly operation. – juanpa.arrivillaga Oct 16 '17 at 18:18
  • @BrendenPetersen Sorry, It is one giant 3.6 gig file with over 30 million records. I updated the question. – John Minze Oct 16 '17 at 18:31
  • See [here](https://stackoverflow.com/questions/8717179/chunking-data-from-a-large-file-for-multiprocessing) for an example of using multiprocessing to read chunks of data from a CSV. I would also think about using something like HDF5 (and python package h5py) for managing data files that large. – Brenden Petersen Oct 16 '17 at 18:32
  • @juanpa.arrivillaga Sorry, I should have included that. I updated the question with that information. I am open to using a format parameter. I am a noob so I having gotten to optimization yet. – John Minze Oct 16 '17 at 18:33
  • Aha. So it *does* seem regular. I would try providing the proper `format` argument to `pd.read_csv`, this will *vastly* improve your performance. You could do this as well as memoizing it, just for good measure. You can use `functools.lru_cache` for this. If both of those doesn't give satisfactory results, *then* consider multiprocessing. – juanpa.arrivillaga Oct 16 '17 at 18:35
  • @juanpa.arrivillaga So telling Python what the data format is saves it from having to do all that 'extra' work in figuring it out? I added the formatting after the 'pd.to_datetime(df['fields'])' and it seemed to speed it up a lot. Thanks! – John Minze Oct 16 '17 at 19:05
  • @BrendenPetersen Thanks. I have seem many entries like that but I am hitting a wall in trying to get it into my code and working. – John Minze Oct 16 '17 at 19:07
  • @JohnMinze yes. Check out my answer, I am getting >8x speedup. But yes, `pd.to_datetime` is handy because it will understand things like `"thursday, april 2 2012"` without having to specify a format. It uses the`dateutil` parser underneath the hood. However, the cost of flexibility is speed, which makes sense. – juanpa.arrivillaga Oct 16 '17 at 19:08

1 Answers1

1

Before you go off into multiprocessing, I would consider dealing with some low-hanging fruit (which you'll want to do regardless):

Consider:

In [15]: df
Out[15]:
   Class OwnerCode   Vendor       Campaign  Cycle Channel   Product  \
0      3       ECM  VendorA         000206  06-17       A  ProductB
1      3       ECM  VendorB         000106  06-17       A  ProductA
2      3       ECM      AID  ED-17-0002-06  06-17       B  ProductB
3      3       ECM      AID  ED-17-0002-06  06-17       C  ProductA

        Week      FinalActualDate State          StartDate
0    Initial  2017-06-14 02:01:00    NE  06-01-17 12:00:00
1    Initial  2017-06-14 00:15:00    NY  06-01-17 12:00:00
2  Secondary  2017-06-13 20:30:00    MA  06-08-17 12:00:00
3      Third  2017-06-15 02:13:00    NE  06-15-17 12:00:00

Since your date-time formats are regular, just pass the format argument. Doing a simple test:

In [16]: dates = df.StartDate.repeat(10000)

In [17]: len(dates)
Out[17]: 40000

In [18]: %timeit pd.to_datetime(df.StartDate)
1000 loops, best of 3: 866 µs per loop

In [19]: %timeit pd.to_datetime(df.StartDate, format="%m-%d-%y %H:%M:%S")
10000 loops, best of 3: 106 µs per loop

I got an 8x increase in speed. Unless you are working with well-over 8 cores, this is a much greater speed-up than you would expect by parallelizing it.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172