1

Are there any vector operations for improving runtime?

I found no other way besides for loops.

Sample DataFrame:

df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'], 
                       'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan','06-Jan', '10-Jan'],
                       'start_value': [12, 15, 1, 3, 2, 6],
                       'end_value': [20, 17, 6,19,13.5,9]})

  ID start_date  start_value  end_value
0  1     01-Jan           12       20.0
1  1     05-Jan           15       17.0
2  1     08-Jan            1        6.0
3  2     05-Jan            3       19.0
4  2     06-Jan            2       13.5
5  2     10-Jan            6        9.0

I've tried:

import pandas as pd

df_original # contains data


data_frame_diff= pd.DataFrame()

for ID in df_original ['ID'].unique():
    tmp_frame = df_original .loc[df_original ['ID']==ID]
    tmp_start_value = 0

    for label, row in tmp_frame.iterrows():
        last_delta = tmp_start_value - row['value']
        tmp_start_value = row['end_value']

        row['last_delta'] = last_delta
        data_frame_diff= data_frame_diff.append(row,True)

Expected Result:

df = pd.DataFrame({'ID': ['1', '1','1','2','2','2'], 
                   'start_date': ['01-Jan', '05-Jan', '08-Jan', '05-Jan', '06-Jan', 
                                 '10-Jan'],
                   'last_delta': [0, 5, 16, 0, 17, 7.5]})

  ID start_date  last_delta
0  1     01-Jan         0.0
1  1     05-Jan         5.0
2  1     08-Jan        16.0
3  2     05-Jan         0.0
4  2     06-Jan        17.0
5  2     10-Jan         7.5

I want to calculate the delta between start_value and end_value of the timestamp and the following timestamp after for each user ID.

Is there a way to improve runtime of this code?

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
m_ennk
  • 35
  • 5
  • 1
    Welcome to StackOverflow. Please include a small sample of your dataframe along with your desired results. Take a look at [`how-to-make-good-reproducible-pandas-examples`](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Shubham Sharma Jul 13 '20 at 06:49
  • Thank you for your comment. I did a small example. – m_ennk Jul 13 '20 at 07:07

2 Answers2

1

Use DataFrame.groupby on ID and shift the column end_value then use Series.sub to subtract it from start_value, finally use Series.fillna and assign this new column s to the dataframe using DataFrame.assign:

s = df.groupby('ID')['end_value'].shift().sub(df['start_value']).fillna(0)
df1 = df[['ID', 'start_date']].assign(last_delta=s)

Result:

print(df1)

  ID start_date  last_delta
0  1     01-Jan         0.0
1  1     05-Jan         5.0
2  1     08-Jan        16.0
3  2     05-Jan         0.0
4  2     06-Jan        17.0
5  2     10-Jan         7.5
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
0

It's a bit difficult to follow from your description what you need, but you might find this helpful:

import pandas as pd

df = (pd.DataFrame({'t1': pd.date_range(start="2020-01-01", end="2020-01-02", freq="H"),
                   })
       .reset_index().rename(columns={'index': 'ID'})
      )

df['t2'] = df['t1']+pd.Timedelta(value=10, unit="H")                 
df['delta_t1_t2'] = df['t2']-df['t1']
df['delta_to_previous_t1'] = df['t1'] - df['t1'].shift()

print(df)

It results in

    ID                  t1                  t2 delta_t1_t2 delta_to_previous_t1
0    0 2020-01-01 00:00:00 2020-01-01 10:00:00    10:00:00                  NaT
1    1 2020-01-01 01:00:00 2020-01-01 11:00:00    10:00:00             01:00:00
2    2 2020-01-01 02:00:00 2020-01-01 12:00:00    10:00:00             01:00:00
3    3 2020-01-01 03:00:00 2020-01-01 13:00:00    10:00:00             01:00:00
divingTobi
  • 2,044
  • 10
  • 25