2

I have a dataframe, produced by this piece of code:

import pandas as pd
import datetime as dt

datarange=pd.date_range('01-05-2018 00:00:00', periods=50, freq="4H")
range_series_1=pd.Series(np.random.randint(-5,3,size=50).astype(float), index=datarange)
range_series_2=pd.Series(np.random.randint(5,9,size=50).astype(float), index=datarange)
frame=pd.DataFrame({'value1':range_series_1, 'value2':range_series_2})
frame.index.name='datetime'

I need two extra columns (e.g. 'column 1' and 'column 2'), which host the following calculations:

column 1 = value1[every day @ 4:00] - value1[same day @ 12:00] --> the value should be placed at 00:00 of the day

column 2 = value1[every day @ 4:00] - value2[same day @ 12:00] --> the value should be placed at 00:00 of the day

The rest of the rows in column 1 and column 2 should be either NaN or 0.

Any pandaic idea how to do this, please?

ADDITIONAL INFORMATION to clarify: I may have a dataframe which won't have datetimes as presented here in such a strict manner, so pd.shift() may not be a global solution.

An example of the desired output is the following (with comments to be clearer):

                     value1  value2  column1   column2
datetime                           
2018-01-05 00:00:00    -3.0     8.0      2.0      -8.0  #2.0=-3.0-(-5.0) and -8.0=-3.0-(5.0)   
2018-01-05 04:00:00    -3.0     8.0      NaN       NaN
2018-01-05 08:00:00     1.0     8.0      NaN       NaN
2018-01-05 12:00:00    -5.0     5.0      NaN       NaN
2018-01-05 16:00:00     0.0     5.0      NaN       NaN
2018-01-05 20:00:00    -4.0     8.0      NaN       NaN
2018-01-06 00:00:00    -5.0     8.0     -1.0     -11.0  #-1.0=-5.0-(-4.0) and -11.0=-5.0-(6.0)
2018-01-06 04:00:00    -5.0     5.0      NaN       NaN
2018-01-06 08:00:00    -2.0     5.0      NaN       NaN
2018-01-06 12:00:00    -4.0     6.0      NaN       NaN
2018-01-06 16:00:00    -1.0     6.0      NaN       NaN
2018-01-06 20:00:00     1.0     7.0      NaN       NaN
2018-01-07 00:00:00    -1.0     5.0      4.0      -4.0  #4.0=2.0-(-2.0) and -4.0=2.0-(6.0)
2018-01-07 04:00:00     2.0     8.0      NaN       NaN
2018-01-07 08:00:00    -1.0     8.0      NaN       NaN
2018-01-07 12:00:00    -2.0     6.0      NaN       NaN
2018-01-07 16:00:00     2.0     6.0      NaN       NaN
2018-01-07 20:00:00    -4.0     6.0      NaN       NaN

I am trying to figure out if this answer can help me, but I am not very familiar with iterating over rows yet ( Update a dataframe in pandas while iterating row by row).

Newbielp
  • 431
  • 3
  • 16
  • Can you add expected output of some rows? – jezrael Oct 25 '19 at 08:05
  • @jezrael, I just updated my question and hopefully it is clearer now. It is puzzling me a lot... thanx in advance for your willingness to help. – Newbielp Oct 25 '19 at 08:25
  • My solution would be very long: create one df_1 with times only 04:00 and another one df_2 with time only 12:00, then `concatenate()` the two and `sort_values` by datetime. Then I would do the calculation using `shift(1)` and would `drop` the rows where time is 12:00. I would then apply `timedelta` 4 hours, to move 04:00 to 00:00 and then, `merge` the resulting df to the initial df. But this is a workaround way too complicated and I believe there is better logic and code than what I have thought. – Newbielp Oct 25 '19 at 08:34

2 Answers2

1

IIUC, you want to shift by time:

shifted = frame.shift(-8, freq='H')

frame['col1'] = frame['value1'] - shifted['value1']
frame['col2'] = frame['value1'] - shifted['value2']

Output (head):

                     value1  value2  col1  col2
datetime                                       
2018-01-05 00:00:00    -3.0     8.0  -2.0  -8.0
2018-01-05 04:00:00     0.0     7.0  -2.0  -5.0
2018-01-05 08:00:00    -1.0     5.0  -1.0  -8.0
2018-01-05 12:00:00     2.0     5.0   3.0  -5.0
2018-01-05 16:00:00     0.0     7.0   4.0  -6.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • It does not output the expected result, unfortunately. I need NaN at time other than 00:00 and I need to control it via index. I am still trying to figure out if anything like `pd.rolling()` would do the job or if should be applied. – Newbielp Oct 25 '19 at 06:42
1

Use:

from datetime import time

np.random.seed(2019)

datarange=pd.date_range('01-05-2018 00:00:00', periods=50, freq="4H")
range_series_1=pd.Series(np.random.randint(-5,3,size=50).astype(float), index=datarange)
range_series_2=pd.Series(np.random.randint(5,9,size=50).astype(float), index=datarange)
frame=pd.DataFrame({'value1':range_series_1, 'value2':range_series_2})
frame.index.name='datetime'
#print (frame)

Idea is compare times for index and use DatetimeIndex.floor for remove times (created default 00:00:00) times for Series and DataFrame:

s = frame.loc[frame.index.time == time(4, 0), 'value1']
s.index = s.index.floor('d')
print (s)
datetime
2018-01-05   -3.0
2018-01-06   -5.0
2018-01-07   -5.0
2018-01-08   -5.0
2018-01-09   -1.0
2018-01-10   -4.0
2018-01-11   -2.0
2018-01-12    0.0
2018-01-13    1.0
Name: value1, dtype: float64

df1 = frame.loc[frame.index.time == time(12, 0), ['value1', 'value2']]
df1.index = df1.index.floor('d')
print (df1)
            value1  value2
datetime                  
2018-01-05     0.0     6.0
2018-01-06     2.0     8.0
2018-01-07    -5.0     7.0
2018-01-08     2.0     7.0
2018-01-09    -1.0     5.0
2018-01-10     1.0     7.0
2018-01-11     2.0     7.0
2018-01-12    -2.0     6.0

Then subtract from right side by DataFrame.rsub, add some prefix to new columns and join to original:

frame = frame.join(df1.rsub(s, axis=0).add_prefix('new_'))
print (frame.head(15))
                     value1  value2  new_value1  new_value2
datetime                                                   
2018-01-05 00:00:00    -5.0     6.0        -3.0        -9.0
2018-01-05 04:00:00    -3.0     5.0         NaN         NaN
2018-01-05 08:00:00     2.0     7.0         NaN         NaN
2018-01-05 12:00:00     0.0     6.0         NaN         NaN
2018-01-05 16:00:00    -5.0     7.0         NaN         NaN
2018-01-05 20:00:00     1.0     6.0         NaN         NaN
2018-01-06 00:00:00     1.0     5.0        -7.0       -13.0
2018-01-06 04:00:00    -5.0     8.0         NaN         NaN
2018-01-06 08:00:00     0.0     6.0         NaN         NaN
2018-01-06 12:00:00     2.0     8.0         NaN         NaN
2018-01-06 16:00:00    -1.0     8.0         NaN         NaN
2018-01-06 20:00:00    -3.0     8.0         NaN         NaN
2018-01-07 00:00:00    -5.0     5.0         0.0       -12.0
2018-01-07 04:00:00    -5.0     8.0         NaN         NaN
2018-01-07 08:00:00     2.0     5.0         NaN         NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I like your solution. It is clear and I am tempted to consider this as the accepted answer. However, I am contemplating a solution with `np.where` but I do not know if data can be accessed via something like `index+timedelta`. – Newbielp Oct 25 '19 at 11:34
  • @Newbielp - Not sure if understand, `np.where` is maybe possible, but no idea hoe... – jezrael Oct 25 '19 at 11:37
  • I am trying to figure out how to code the calculation... but for this to work, I should be able to access a row in a manner described by this pseudocode: frame.loc[[datetime+delta]]. It has intrigued me and I will try a bit more. If I find anything I will let you know. Otherwise, your solution is great. – Newbielp Oct 25 '19 at 12:07