9

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • "python pools" - threads or processes? – Ami Tavory Sep 02 '16 at 05:57
  • I was using multiprocessing.Pool(processes= #ofCPU) – Georg Heiler Sep 02 '16 at 06:00
  • So multiprocessing is not guaranteed to speed up your code, but, since the code wasn't working correctly, it's hard to know what at all it was running there. You might want to make your question about that (FWIW, this approach looks like your best bet to me). – Ami Tavory Sep 02 '16 at 06:04
  • Would cythonizing not be a good first step before you resort to parallelizing apply? – SerialDev Sep 02 '16 at 07:21
  • As far as I understand the problem it is embarrassingly parallel e.g. each row is independent, so parallel execution should be better suited. – Georg Heiler Sep 02 '16 at 08:43
  • @geoHeil I'll be happy to look at it a bit later. – Ami Tavory Sep 02 '16 at 09:58
  • Have you considered using dask and a dask.dataframe instead? It would give you an easy way to parallelize your calculation "for free" – Zeugma Sep 02 '16 at 13:57
  • I will have to look into that. Is it generally as good as pandas but only parallelized? – Georg Heiler Sep 02 '16 at 14:17
  • @AmiTavory I added a minimal code example on github: https://github.com/geoHeil/pythonQuestions/blob/master/improvementsDateOperations.ipynb – Georg Heiler Sep 03 '16 at 15:25
  • @geoHeil Sorry about that. I'll really try to have a look a bit later on. – Ami Tavory Sep 03 '16 at 15:27
  • @geoHeil So just a question about the setting - are the holidays all (or most) at fixed dates each year, or do they vary? – Ami Tavory Sep 04 '16 at 06:13
  • They are from http://www.timeanddate.com/holidays/germany/ I filter for national holidays. These are usually fixed e.g. Christmas on the 24th of December – Georg Heiler Sep 04 '16 at 06:16
  • @AmiTavory I just updated the minimum example: https://github.com/geoHeil/pythonQuestions/blob/master/improvementsDateOperations.ipynb even for my approach 3 only the existing column is returned. I do not see the computed result. – Georg Heiler Sep 04 '16 at 06:46
  • @geoHeil I've attempted an answer with an approach that doesn't rely on parallel stuff... Wasn't 100% sure how adamant you were you wanted that or whether you were just trying anything to speed things up... – Jon Clements Sep 04 '16 at 09:02
  • @NinjaPuppy will need to try your solution first but I am looking for a quicker solution. As the problem should parallelize fine I thought this would be the way to go. Unfortunately I am new to python and struggle to set up parallel processing correctly. As you can see the computation is performed in parallel but the results are not returned. If you have an idea what is wrong I would be glad. – Georg Heiler Sep 04 '16 at 09:07
  • @geoHeil I generally find that by the time I've got parallel stuff working properly I could have just ran the thing a few hundred times and have been done already :p Anyway... hopefully the suggestion should be fast enough... don't fancy looking into parallel stuff for a Sunday morning :) – Jon Clements Sep 04 '16 at 09:08
  • That sounds great – Georg Heiler Sep 04 '16 at 09:09

4 Answers4

6

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

from joblib import Parallel, delayed
import multiprocessing

def get_nearest_dateParallel(df):
    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
5

I think that the pandarallel package makes it way easier to do this now. Have not looked into it much, but should do the trick.

Charitarth Chugh
  • 164
  • 2
  • 10
4

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I updated the minimum example with your code (please see at the bootom). Trying to use "my dateimeIndices" for the holidays I receive an index out of bounds. – Georg Heiler Sep 04 '16 at 09:34
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/122604/discussion-on-answer-by-ninja-puppy-parallelize-pandas-apply). – Jon Clements Sep 04 '16 at 09:41
2

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    """Your awesome function"""
    return np.sqrt(np.sum(x ** 2))    

df = pd.DataFrame(np.random.random((1000, 1000)))

%%time
res = df.apply(foo, raw=True)

Wall time: 5.3 s

# p_apply - is parallel analogue of apply method
%%time
res = df.p_apply(foo, raw=True, executor='processes')

Wall time: 1.2 s
padu
  • 689
  • 4
  • 10