1

I have the problem that the code from below is extremely slow. I haven't been working with Python and Pandas that long, so I don't know exactly where to start.

I want to determine the predecessor and successor of each row.

Currently I iterate over each row and output the rows that meet my conditions. From these series I determine the maximum and the minimum once.

I have the following record:

index   Case    Button      Start                       rowNow
0       x       a           2017-12-06 10:17:43.227     0
1       x       b           2017-12-06 10:17:44.876     1
2       x       c           2017-12-06 10:17:45.719     2
3       y       a           2017-12-06 15:28:57.500     3
4       y       e           2017-12-06 15:29:19.079     4

And I want to get it:

index   Case    Button      Start                       rowNow  prevNum nextNum
0       x       a           2017-12-06 10:17:43.227     0       NaN     1
1       x       b           2017-12-06 10:17:44.876     1       0       2
2       x       c           2017-12-06 10:17:45.719     2       1       NaN
3       y       a           2017-12-06 15:28:57.500     3       NaN     4
4       y       e           2017-12-06 15:29:19.079     4       3       NaN

Could someone give me some tips on how to optimize the speed of this code? Can vectorization be used here at all?

for index, row in df.iterrows():

    x = df[(df['Case'] == row['Case']) & (df['rowNow'] < row['rowNow']) & (row['Start'] >= df['Start'])]
    df.loc[index,'prevNum'] = x['rowNow'].max()
    y = df[(df['Case'] == row['Case']) & (df['rowNow'] > row['rowNow']) & (row['Start'] <= df['Start'])]    
    df.loc[index,'nextNum'] = y['rowNow'].min()
KoYan
  • 25
  • 5
  • https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues – Rahul Agarwal Sep 11 '18 at 08:55
  • https://stackoverflow.com/questions/51149735/faster-alternative-to-iterrows – Rahul Agarwal Sep 11 '18 at 08:56
  • @RahulAgarwal I don't get how the second link is relevant, and the purpose of the OP's question is to find a vectorized approach re: the first link – roganjosh Sep 11 '18 at 08:58
  • hint: one approach might be to use [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) in combination with [`shift`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html) – roganjosh Sep 11 '18 at 09:00
  • the 2nd link I pasted because the questions asks for faster alternative for iterrows...It is just if the OP can get a idea...I have not marked as duplicate etc. – Rahul Agarwal Sep 11 '18 at 09:01
  • @RahulAgarwal I think it serves to confuse the issue tbh since there's literally thousands of examples that could be used to illustrate that, so linking that specific question appears to suggest the problem is relevant, and it isn't – roganjosh Sep 11 '18 at 09:02

2 Answers2

1

Try:

df['Start']=pd.to_datetime(df['Start'])
df['prevNum']=df['rowNow'].shift()
df['nextNum']=df['rowNow'].shift(-1)
df.loc[df['Start'].dt.hour!=df['Start'].shift().dt.hour,'prevNum']=pd.np.nan
df.loc[df['Start'].dt.hour!=df['Start'].shift(-1).dt.hour,'nextNum']=pd.np.nan
print(df)

If column start is not a datetime format, do:

df['Start']=pd.to_datetime(df['Start'])

before everything

Output:

  index Case      Button                   Start  rowNow  prevNum  nextNum
0     x    a  2017-12-06 2018-09-11 10:17:43.227       0      NaN      1.0
1     x    b  2017-12-06 2018-09-11 10:17:44.876       1      0.0      2.0
2     x    c  2017-12-06 2018-09-11 10:17:45.719       2      1.0      NaN
3     y    a  2017-12-06 2018-09-11 15:28:57.500       3      NaN      4.0
4     y    e  2017-12-06 2018-09-11 15:29:19.079       4      3.0      NaN
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • In my case, @giser_yugang solution also works. But I think that this solution is responsive to the conditions. So I have another question, how would I proceed if I had another column with an end time? So: `x = df[(df['Case'] == row['Case']) & (df['rowNow'] < row['rowNow']) & (row['Start'] >= df['Complete'])]`and `y = df[(df['Case'] == row['Case']) & (df['rowNow'] > row['rowNow']) & (row['Complete'] <= df['Start'])]` – KoYan Sep 11 '18 at 09:26
  • @KoYan Can you Please show me the whole data containing the end time column and please explain what you want, anyway, happy that i helped, – U13-Forward Sep 11 '18 at 23:10
1

Try it:

df['prevNum'] = df.groupby('Case').apply(lambda x:x[['rowNow']].shift(1))
df['nextNum'] = df.groupby('Case').apply(lambda x:x[['rowNow']].shift(-1))
giser_yugang
  • 6,058
  • 4
  • 21
  • 44