Given,
# importing pandas as pd
import pandas as pd
import numpy as np
# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'],
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6],}
df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01',
'1/02','1/03', '1/04','1/05', '1/06'])
It is easier to tackle this problem by breaking it into two parts (absolute thresholds and relative thresholds) and going through it step by step on the underlying numpy arrays.
EDIT: Long explanation ahead, skip to the end for just the final function
First, create a list of date columns to access only the relevant columns in every row.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06']
df[date_columns].values
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7, 0.9],
[1.1, 0.3, 1. , 0.7, 0.7, 1.1],
[2.1, 0.5, 4. , 0.1, 0.1, 2.1],
[1. , 0.9, 1.1, 0.7, 1. , 0.6]])
Then we can use np.diff to easily get differences between the dates on the underlying array. We will also take an absolute because that is what we are interested in.
np.abs(np.diff(df[date_columns].values))
#Output:
array([[0.5, 0.6, 0.3, 0. , 0.2],
[0.8, 0.7, 0.3, 0. , 0.4],
[1.6, 3.5, 3.9, 0. , 2. ],
[0.1, 0.2, 0.4, 0.3, 0.4]])
Now, just worrying about the absolute thresholds, it is as simple as just checking if the values in the differences are greater than a limit.
abs_threshold = 0.5
np.abs(np.diff(df[date_columns].values)) > abs_threshold
#Output:
array([[False, True, False, False, False],
[ True, True, False, False, False],
[ True, True, True, False, True],
[False, False, False, False, False]])
We can see that the sum over this array for every row will give us the result we need (sum over boolean arrays use the underlying True=1 and False=0. Thus, you are effectively counting how many True are present). For Percentage thresholds, we just need to do an additional step, dividing all differences with the original values before comparison. Putting it all together.
To elaborate:
We can see how the sum along each row can give us the counts of values crossing absolute threshold as follows.
abs_fluctuations = np.abs(np.diff(df[date_columns].values)) > abs_threshold
print(abs_fluctuations.sum(-1))
#Output:
[1 2 4 0]
To start with relative thresholds, we can create the differences array same as before.
dates = df[date_columns].values #same as before, but just assigned
differences = np.abs(np.diff(dates)) #same as before, just assigned
pct_threshold=0.5 #aka 50%
print(differences.shape) #(4, 5) aka 4 rows, 5 columns if you want to think traditional tabular 2D shapes only
print(dates.shape) #(4, 6) 4 rows, 6 columns
Now, note that the differences array will have 1 less number of columns, which makes sense too. because for 6 dates, there will be 5 "differences", one for each gap.
Now, just focusing on 1 row, we see that calculating percent changes is simple.
print(dates[0][:2]) #for first row[0], take the first two dates[:2]
#Output:
array([0.9, 0.4])
print(differences[0][0]) #for first row[0], take the first difference[0]
#Output:
0.5
a change from 0.9 to 0.4
is a change of 0.5
in absolute terms. but in percentage terms, it is a change of 0.5/0.9
(difference/original) * 100 (where i have omitted the multiplication by 100 to make things simpler)
aka 55.555%
or 0.5555
..
The main thing to realise at this step is that we need to do this division against the "original" values for all differences to get percent changes.
However, dates array has one "column" too many. So, we do a simple slice.
dates[:,:-1] #For all rows(:,), take all columns except the last one(:-1).
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7],
[1.1, 0.3, 1. , 0.7, 0.7],
[2.1, 0.5, 4. , 0.1, 0.1],
[1. , 0.9, 1.1, 0.7, 1. ]])
Now, i can just calculate relative or percentage changes by element-wise division
relative_differences = differences / dates[:,:-1]
And then, same thing as before. pick a threshold, see if it's crossed
rel_fluctuations = relative_differences > pct_threshold
#Output:
array([[ True, True, False, False, False],
[ True, True, False, False, True],
[ True, True, True, False, True],
[False, False, False, False, False]])
Now, if we want to consider whether either one of absolute or relative threshold is crossed, we just need to take a bitwise OR |
(it's even there in the sentence!) and then take the sum along rows.
Putting all this together, we can just create a function that is ready to use. Note that functions are nothing special, just a way of grouping together lines of code for ease of use. using a function is as simple as calling it, you have been using functions/methods without realising it all the time already.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06'] #if hardcoded.
date_columns = df.columns[5:] #if you wish to assign dynamically, and all dates start from 5th column.
def get_FCount(df, date_columns, abs_threshold=0.5, pct_threshold=0.5):
'''Expects a list of date columns with atleast two values.
returns a 1D array, with FCounts for every row.
pct_threshold: percentage, where 1 means 100%
'''
dates = df[date_columns].values
differences = np.abs(np.diff(dates))
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dates[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(-1) #we took a bitwise OR. since we are concerned with values that cross even one of the thresholds.
df['FCount'] = get_FCount(df, date_columns) #call our function, and assign the result array to a new column
print(df['FCount'])
#Output:
0 2
1 3
2 4
3 0
Name: FCount, dtype: int32