3

I have a data frame in pandas that is organized like so:

btc_price['btc_price'] = pd.to_numeric(btc_price['btc_price'].str.replace(',', ''))
btc_price.head(n=120)
Out[4]: 
                      btc_price
time                           
2017-08-27 22:50:00  4,389.6113
2017-08-27 22:51:00  4,389.0850
2017-08-27 22:52:00  4,388.8625
2017-08-27 22:53:00  4,389.7888
2017-08-27 22:56:00  4,389.9138
2017-08-27 22:57:00  4,390.1663
2017-08-27 22:58:00  4,390.2600
2017-08-27 22:59:00  4,392.4013
2017-08-27 23:00:00  4,391.6588
2017-08-27 23:01:00  4,391.9213
2017-08-27 23:02:00  4,394.0113
2017-08-27 23:03:00  4,396.9713
2017-08-27 23:04:00  4,397.3350
2017-08-27 23:05:00  4,397.0700
2017-08-27 23:06:00  4,398.6188
2017-08-27 23:07:00  4,398.5725
2017-08-27 23:08:00  4,397.4713
2017-08-27 23:09:00  4,398.0938
2017-08-27 23:10:00  4,398.7775
2017-08-27 23:11:00  4,398.0200
2017-08-27 23:12:00  4,397.9513
2017-08-27 23:13:00  4,398.0613
2017-08-27 23:14:00  4,398.0900
2017-08-27 23:15:00  4,398.0063
2017-08-27 23:16:00  4,397.6088
2017-08-27 23:17:00  4,394.3763
2017-08-27 23:46:00  4,389.1100
2017-08-27 23:48:00  4,390.6763
2017-08-27 23:49:00  4,392.5388
2017-08-27 23:49:00  4,392.5388
                        ...
2017-08-28 00:51:00  4,367.5738
2017-08-28 00:51:00  4,367.5738
2017-08-28 00:52:00  4,367.7888
2017-08-28 00:53:00  4,368.4188
2017-08-28 00:54:00  4,368.8225
2017-08-28 00:55:00  4,368.7438
2017-08-28 00:57:00  4,368.4700
2017-08-28 00:58:00  4,367.9963
2017-08-28 00:59:00  4,366.4750
2017-08-28 01:00:00  4,359.1988
2017-08-28 01:01:00  4,355.2825
2017-08-28 01:02:00  4,352.3675
2017-08-28 01:03:00  4,354.2188
2017-08-28 01:04:00  4,353.5263
2017-08-28 01:05:00  4,354.2488
2017-08-28 01:06:00  4,358.8063
2017-08-28 01:07:00  4,359.5738
2017-08-28 01:08:00  4,361.7313
2017-08-28 01:09:00  4,360.8638
2017-08-28 01:10:00  4,363.0750
2017-08-28 01:11:00  4,362.3375
2017-08-28 01:12:00  4,362.3338
2017-08-28 01:13:00  4,358.8000
2017-08-28 01:14:00  4,354.0463
2017-08-28 01:15:00  4,356.1950
2017-08-28 01:16:00  4,359.5975
2017-08-28 01:17:00  4,360.1588
2017-08-28 01:18:00  4,362.2338
2017-08-28 01:19:00  4,363.7900
2017-08-28 01:20:00  4,362.6150

I would like to create a new column change that has a value of -1,0,1. These should correspond to a 5% decrease in price over the past hour (-1), "no change" (0), and a 5% increase in price over the past hour(1). Also, the value of one hour should be mutable so I could change it to one day or 30 minutes for example, as I see fit.

I found similar questions here and here but I am new to python and am not really sure how to apply these solutions specifically to my problem.

The other option is to calculate the avg price for each hour and then calculate the % change on an hourly basis, but I would prefer to be able to use a rolling timeframe.

I have also tried doing this in R with no luck. Please help.

I started by trying:

btc_price['change'] = btc_price.pct_change(periods=60, fill_method='pad', limit=None, freq=None) 

This works but does not quite give me what I am looking for, I would like to compare each value to the min and max values for the past "time frame" and calculate the % change based on this value instead of simply comparing two rows.

What I want to end up with is something like this (incomplete):

# Calculate the % change in btc_price
def calc_change(df):
    array = df.values
    a = array[:,1]

# Apply % change comparison to timeframe 
def rolling(df, period, func, min_periods = None):
        if min_periods is None:
            min_periods = period
        result = pd.Series(np.nan, index = df.index)

        for i in range(1, len(df) + 1):
            sub_df = df.iloc[max(i)]

I believe I could use something like df.rolling() found here but am not quite sure if this is exactly what I want because I dont quite understand how it works. An explanation would be great.

zsad512
  • 861
  • 3
  • 15
  • 41

1 Answers1

11
  1. You can paste DF in more friendly format, so that people can import it using pd.read_clipboard
  2. You should remove ',' from the btc_price and cast it using pd.to_numeric.

After having valid data, you can do:

In [59]: df.head()
Out[59]: 
                     btc_price
time                          
2017-09-07 22:50:00  4389.6113
2017-09-07 22:51:00  4389.0850
2017-09-07 22:52:00  4388.8625
2017-09-07 22:53:00  4389.7888
2017-09-07 22:56:00  4389.9138

In [60]: df = df.resample('1MIN').ffill(); df.head(10)
Out[60]: 
                     btc_price
time                          
2017-09-07 22:50:00  4389.6113
2017-09-07 22:51:00  4389.0850
2017-09-07 22:52:00  4388.8625
2017-09-07 22:53:00  4389.7888
2017-09-07 22:54:00  4389.7888
2017-09-07 22:55:00  4389.7888
2017-09-07 22:56:00  4389.9138
2017-09-07 22:57:00  4390.1663
2017-09-07 22:58:00  4390.2600
2017-09-07 22:59:00  4392.4013

In [61]: WINDOW = 5  # 5 minutes, you can change to any window you want. Has to match resolution from resample

In [63]: df['change'] = df['btc_price'].pct_change(periods=WINDOW); df.head(10)
Out[63]: 
                     btc_price    change
time                                    
2017-09-07 22:50:00  4389.6113       NaN
2017-09-07 22:51:00  4389.0850       NaN
2017-09-07 22:52:00  4388.8625       NaN
2017-09-07 22:53:00  4389.7888       NaN
2017-09-07 22:54:00  4389.7888       NaN
2017-09-07 22:55:00  4389.7888  0.000040
2017-09-07 22:56:00  4389.9138  0.000189
2017-09-07 22:57:00  4390.1663  0.000297
2017-09-07 22:58:00  4390.2600  0.000107
2017-09-07 22:59:00  4392.4013  0.000595

In [64]: import numpy as np

]n [67]: df['change_label'] = pd.cut(df['change'], [np.NINF, -0.05, 0.05, np.PINF], labels=['below 5%', 'around 0%', 'above 5%']) 

In [69]: df.head(10)
Out[69]: 
                     btc_price        change change_label
time                                                     
2017-09-07 22:50:00  4389.6113           NaN          NaN
2017-09-07 22:51:00  4389.0850           NaN          NaN
2017-09-07 22:52:00  4388.8625           NaN          NaN
2017-09-07 22:53:00  4389.7888           NaN          NaN
2017-09-07 22:54:00  4389.7888           NaN          NaN
2017-09-07 22:55:00  4389.7888  4.043638e-05    around 0%
2017-09-07 22:56:00  4389.9138  1.888321e-04    around 0%
2017-09-07 22:57:00  4390.1663  2.970701e-04    around 0%
2017-09-07 22:58:00  4390.2600  1.073400e-04    around 0%
2017-09-07 22:59:00  4392.4013  5.951311e-04    around 0%

Feels that you need to:

  1. Resample in order to get predictable resolution
  2. FFill in order to not have holes. Or handle that in other way that makes sense in your case.
  3. Use pct_change.
  4. For turning change into label, you can use pd.cut. Also, simple df['change'].map(lamba v: # here logic) would work.
grechut
  • 2,897
  • 1
  • 19
  • 18
  • 2
    My only question here is how is the % change being calculated? For example, is it looking only at the value from 5 minutes earlier (5 rows) and calculating the % change based on the difference for the current row? How can I adjust it to look at all values within the `WINDOW` and calculate based on the min/max? – zsad512 Sep 07 '17 at 15:51
  • `pct_change` calculates change from n periods (n rows) earlier. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pct_change.html So connected with shifting. – grechut Sep 07 '17 at 18:39
  • For min/max in the window, consider using rolling and min / max / apply. Depending on your needs. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html – grechut Sep 07 '17 at 18:40
  • In [2]: s = pd.Series([2, 2, 2, 2, 1, 4, 2, 4, 8, 4, 1]) – grechut Sep 07 '17 at 18:45
  • In [3]: s.rolling(3).min() Out[3]: 0 NaN 1 NaN 2 2.0 3 2.0 4 1.0 5 1.0 6 1.0 7 2.0 8 2.0 9 4.0 10 1.0 dtype: float64 – grechut Sep 07 '17 at 18:45
  • In [5]: s.rolling(3).max() / s.rolling(3).min() Out[5]: 0 NaN 1 NaN 2 1.0 3 1.0 4 2.0 5 4.0 6 4.0 7 2.0 8 4.0 9 2.0 10 8.0 dtype: float64 – grechut Sep 07 '17 at 18:46
  • In [6]: s.rolling(3).apply(lambda df: (df.max() - df.min()) / df.min()) Out[6]: 0 NaN 1 NaN 2 0.0 3 0.0 4 1.0 5 3.0 6 3.0 7 1.0 8 3.0 9 1.0 10 7.0 dtype: float64 – grechut Sep 07 '17 at 18:46
  • 1
    Not sure about your exact needs :) `apply` gives you DataFrame with all data points within window. So then it's just up to you what you'll do with them – grechut Sep 07 '17 at 18:46
  • `s.rolling(3).apply(lambda df: (df.max() - df.min()) / df.min())` what if instead of (max-min)/min, I wanted two functions: ((max-current)/current) OR ((current-min)/current) does that make sense? – zsad512 Sep 07 '17 at 18:54
  • (window_max - current) / current can be achieved via: In [9]: (s.rolling(3).max() - s) / s Out[9]: 0 NaN 1 NaN 2 0.0 3 0.0 4 1.0 5 0.0 6 1.0 7 0.0 8 0.0 9 1.0 10 7.0 dtype: float64 – grechut Sep 07 '17 at 19:08
  • Note that you can also specify `min_periods` argument of rolling to lower values, so that you do not have to have all points available within window. – grechut Sep 07 '17 at 19:08