1

I have a dataframe df as below:

Datetime                Value
2020-03-01 08:00:00      10
2020-03-01 10:00:00      12
2020-03-01 12:00:00      15
2020-03-02 09:00:00       1
2020-03-02 10:00:00       3
2020-03-02 13:00:00       8
2020-03-03 10:00:00      20
2020-03-03 12:00:00      25
2020-03-03 14:00:00      15

I would like to calculate the difference between the value on the first time of each date and the last time of each date (ignoring the value of other time within a date), so the result will be:

Datetime      Value_Difference
2020-03-01          5
2020-03-02          7
2020-03-03         -5

I have been doing this using a for loop, but it is slow (as expected) when I have larger data. Any help will be appreciated.

Shaido
  • 27,497
  • 23
  • 70
  • 73
teteh May
  • 455
  • 2
  • 11
  • what is your type object of your dataframe ? A list ? – Thibault Cimic Aug 17 '20 at 09:23
  • it is pandas dataframe with column Datetime as datetime64[ns] and column Value and Value_Difference as float64 – teteh May Aug 17 '20 at 09:28
  • Ok so it is a dictionnary : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html – Thibault Cimic Aug 17 '20 at 09:33
  • When you say large data, is it a lot of values for the same day ? Or a lot of days with still pretty much the same number of value per day ? Or both a lot of days with a lot of values per day ? And could you print typeof(df("2020-03-01 08:00:00)) for example ? – Thibault Cimic Aug 17 '20 at 09:34

2 Answers2

2

One solution would be to make sure the data is sorted by time, group by the data and then take the first and last value in each day. This works since pandas will preserve the order during groupby, see e.g. here.

df = df.sort_values(by='Datetime').groupby(df['Datetime'].dt.date).agg({'Value': ['first', 'last']})
df['Value_Difference'] = df['Value']['last'] - df['Value']['first']
df = df.drop('Value', axis=1).reset_index()

Result:

Datetime      Value_Difference
2020-03-01          5
2020-03-02          7
2020-03-03         -5
Shaido
  • 27,497
  • 23
  • 70
  • 73
0

Shaido's method works, but might be slow due to the groupby on very large sets

Another possible way is to take a difference from dates converted to int and only grab the values necessary without a loop.

idx = df.index

loc = np.diff(idx.strftime('%Y%m%d').astype(int).values).nonzero()[0]

loc1 = np.append(0,loc)

loc2 = np.append(loc,len(idx)-1)

res = df.values[loc2]-df.values[loc1]

df = pd.DataFrame(index=idx.date[loc1],values=res,columns=['values'])
EddyG
  • 675
  • 4
  • 13