1

I have the following function where df is a pandas dataframe that is 159538 rows x 3 columns:

dfs = []
for i in df['email_address']:
    data = df[df['email_address'] == i]
    data['difference'] = data['ts_placed'].diff().astype('timedelta64[D]')
    repeat = []
    for a in data['difference']:
        if a > 10:
            repeat.append(0)
        elif a <= 10:
            repeat.append(1)
        else:
            repeat.append(0)
    data['repeat'] = repeat
    dfs.append(data)

the function runs extremely slow. I would like to speed up the process by using multiprocessing. This SO question shows how to do this in R. What is the equivalent code for python?

this is a sample of the data after running:

df['difference'] = df.groupby('email_address')['ts_placed'].diff()



df
Out[6]: 
                           email_address           ts_placed       difference
0                  aaaaaaaaaaaaa@sky.com 2015-08-06 00:00:34              NaT
1              dfdfdfdfdfd@babcock.co.uk 2015-08-06 00:05:38              NaT
2        littlemifddreen85@hotmail.co.uk 2015-08-06 00:09:20              NaT
3                    smifdfddfms@aol.com 2015-08-06 00:10:01              NaT
4       terry.wfdfdfdfdfy-holdings.co.uk 2015-08-06 00:14:00              NaT
5              r.dfdfdfdfd16@hotmail.com 2015-08-06 00:14:00              NaT
6                 kdfdfdf979@outlook.com 2015-08-06 00:14:00              NaT
7           dd@ggggggggggg.eclipse.co.uk 2015-08-06 00:14:20              NaT
8                   gggz45@hotmail.co.uk 2015-08-06 00:14:43              NaT
9               gggggggggi@hotmail.co.uk 2015-08-06 00:17:03              NaT
10             mggggggggyke1@hotmail.com 2015-08-06 00:17:58              NaT
...
22                   ffdddfddd@yahoo.com 2015-08-06 00:46:12  0 days 00:04:15
Blue Moon
  • 4,421
  • 20
  • 52
  • 91
  • It looks all you're doing is grouping on email addresses then calculating `diff` on those groups and then assigning a count if the diff is larger than `10` correct? Also if this is `pandas` then tag it so – EdChum Nov 09 '15 at 10:50
  • Can you post raw input data and code to reproduce your df to avoid any ambiguities – EdChum Nov 09 '15 at 10:56

1 Answers1

1

IIUC then you can do the following:

df['difference'] = df.groupby('email_address')['ts_placed'].diff()

df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x < 10).cumcount())
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Can you post your data, code and the error, I've edited it – EdChum Nov 09 '15 at 10:59
  • this is the error I get AttributeError: 'Series' object has no attribute 'cumcount'. I cannot post data due to confidentiality. However, the email_address column is a string while the ts_placed column is a datetime64[ns] both of them have no missing values. On the other hand, the column difference is a timedelta64[ns] and has several NaT values – Blue Moon Nov 09 '15 at 11:05
  • How about: `df.groupby('email_address')['difference'].transform(lambda x: x < 10).cumcount()` – EdChum Nov 09 '15 at 11:06
  • Can you not just anonymise the data and just post a small snippet, it's difficult to post answers and interpret your errors without data and code – EdChum Nov 09 '15 at 11:07
  • OK, can you try this: `df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x.dt.seconds < 10).sum())` – EdChum Nov 09 '15 at 11:27