0

I have a dataframe containing a table of financial timeseries, with each row having the columns:

  • ID of that timeseries
  • a Target value (against which we want to measure deviations, both relative and absolute)
  • and a timeseries of values for various dates: 1/01, 1/02, 1/03, ...

We want to calculate the fluctuation counts, both relative and absolute, for every row/ID's timeseries. Then we want to find which row/ID has the most fluctuations/'spikes', as follows:

  • First, we find difference between two timeseries values and estimate a threshold. Threshold represents how much difference is allowed between two values before we declare that a 'fluctuation' or 'spike'. If the difference is higher than the threshold you set, between any two columns's values then it's a spike.
    • However, we need to ensure that the threshold is generic and works with both % and absolute values between any two values in any row.
    • So basically, we find a threshold in a percentage form (make an educated prediction) as we have one row values represented in "%" form. Plus, '%' form will also work properly with the absolute value as well.
  • The output should be a new column fluctuation counts (FCount), both relative and absolute, for every row/ID.

Code:

import pandas as pd 

# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'], 
  'Domain': ['Finance', 'IT', 'IT', 'Finance'], 
  'Target': [1, 2, 3, 0.9%], 
  'Criteria':['<=', '<=', '>=', '>='],
  "1/01":[0.9, 1.1, 2.1, 1],
  "1/02":[0.4, 0.3, 0.5, 0.9], 
  "1/03":[1, 1, 4, 1.1], 
  "1/04":[0.7, 0.7, 0.1, 0.7],
  "1/05":[0.7, 0.7, 0.1, 1], 
  "1/06":[0.9, 1.1, 2.1, 0.6],}

df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01', 
  '1/02','1/03', '1/04','1/05', '1/06'])

   ID   Domain  Target Criteria  1/01  1/02  1/03  1/04  1/05  1/06  
0  A1  Finance       1       <=   0.9   0.4   1.0   0.7   0.7   0.9  
1  B1       IT       2       <=   1.1   0.3   1.0   0.7   0.7   1.1  
2  C1       IT       3       >=   2.1   0.5   4.0   0.1   0.1   2.1  
3  D1  Finance     0.9%      >=   1.0   0.9   1.1   0.7   1.0   0.6

And here's the expect output with a fluctuation count (FCount) column. Then we can get whichever ID has the largest FCount.

   ID   Domain  Target Criteria  1/01  1/02  1/03  1/04  1/05  1/06  FCount
0  A1  Finance       1       <=   0.9   0.4   1.0   0.7   0.7   0.9    -
1  B1       IT       2       <=   1.1   0.3   1.0   0.7   0.7   1.1    - 
2  C1       IT       3       >=   2.1   0.5   4.0   0.1   0.1   2.1    - 
3  D1  Finance     0.9%      >=   1.0   0.9   1.1   0.7   1.0   0.6    -
smci
  • 32,567
  • 20
  • 113
  • 146
krijan
  • 117
  • 8
  • 3
    what do you mean by *fluctuating the most*? largest range, largest std? – Quang Hoang Jul 04 '19 at 01:17
  • you should give the code that print this table, This way can better understand and see how to get you an accepted answer. – Nicolas Racine Jul 04 '19 at 01:21
  • comparing the date values to its target. So, basically finding IDs who's date values are not stable to its target. – krijan Jul 04 '19 at 01:21
  • @NicolasRacine Please see the edits – krijan Jul 04 '19 at 01:23
  • 2
    Well, you may compare values at each date and calculate how much it diverges from the target. But, what is your criteria when you say "fluctuating the most" as @QuangHoang said? If it is standard deviation, then just substract target from date values and get the std of resulting dataset for each row. Also if this is your data design, it is terribly denormalized. – Sıddık Açıl Jul 05 '19 at 02:15
  • Correct, I have to compare values at each date and calculate how much it diverges from the target. This is why I want to find spikes, so I can visually see how much difference is present between each values compare to target. Hence, Whichever ID has the highest spikes difference will be "fluctuating the most". – krijan Jul 05 '19 at 02:40
  • However, i don't know how to execute that, this is why i asked for assistance. – krijan Jul 05 '19 at 02:41
  • Please check the edit i made. I had chat with my boss on what exactly he is after. – krijan Jul 09 '19 at 01:01
  • Welcome to SO, this is a great first question. However you need to clearly state your questions and define your terms; please edit the question. *"Finding fluctuations"* is very vague, you need to define precisely what it means and show a clear example (in your case I think you meant *"relative and absolute deviations from each row's Target value for the timeseries"*). Definitely don't say something like "find spikes in the date columns". ... – smci Jul 11 '19 at 02:26
  • ... And don't just say *"in this table"* then expect every reader to spend unnecessary time decoding what's going on in your pretty dense dataframe format where each row represents a financial timeseries; with the columns `ID`, a `Target` value (against which we want to measure relative and absolute deviations), and a timeseries of values for various dates: `1/01, 1/02, 1/03, ...` In short, **make it super-easy for the reader to instantly understand what your data represents, and explain concisely you want to do to it** – smci Jul 11 '19 at 02:28
  • ...*"the most fluctuating date values"* is meaningless. You mean *"timeseries values"*, so say that. Specifically **"the row-counts of timeseries values which exceed a relative or absolute deviation, respectively"**. Being precise is hard but you gotta put in the effort, otherwise this question will not be a reusable resource for anyone, and cannot really be understood without also viewing the data and running the code. – smci Jul 11 '19 at 02:33
  • Really it is probably cleaner to keep two separate dataframes: a ***`metadata`*** dataframe for the columns `'ID', 'Domain', 'Target','Criteria'`, and a ***`values`*** dataframe for the columns `'1/01', '1/02','1/03', '1/04','1/05', '1/06'`. You then **set the (shared) pandas index on both dataframes to be `ID`**, now it's easy to handle them separately yet be able to do a pandas merge/join on results, since they both have `ID` as their index. That's so much cleaner in code than constantly slicing columns and column-indices. – smci Jul 11 '19 at 03:17
  • As @AndrasDeak said, noone used 'Criteria'... OP please edit the question to define it. But it's probably more simply defined by using a dict to map the string '>='/'<=' to either an integer +1/-1 column, or to the operators [pandas.Series.gt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.gt.html), [pandas.Series.lt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.lt.html)... although for new users probably more clear to do `if-else` based on `'Criteria' == '>='/'<='` – smci Jul 11 '19 at 05:32

4 Answers4

3

Given,

# importing pandas as pd 
import pandas as pd
import numpy as np

# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'], 
'Domain': ['Finance', 'IT', 'IT', 'Finance'], 
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9], 
"1/03":[1, 1, 4, 1.1], 
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1], 
"1/06":[0.9, 1.1, 2.1, 0.6],}



df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01', 
'1/02','1/03', '1/04','1/05', '1/06'])

It is easier to tackle this problem by breaking it into two parts (absolute thresholds and relative thresholds) and going through it step by step on the underlying numpy arrays.


EDIT: Long explanation ahead, skip to the end for just the final function

First, create a list of date columns to access only the relevant columns in every row.

date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06']
df[date_columns].values
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7, 0.9],
       [1.1, 0.3, 1. , 0.7, 0.7, 1.1],
       [2.1, 0.5, 4. , 0.1, 0.1, 2.1],
       [1. , 0.9, 1.1, 0.7, 1. , 0.6]])

Then we can use np.diff to easily get differences between the dates on the underlying array. We will also take an absolute because that is what we are interested in.

np.abs(np.diff(df[date_columns].values))
#Output:
array([[0.5, 0.6, 0.3, 0. , 0.2],
       [0.8, 0.7, 0.3, 0. , 0.4],
       [1.6, 3.5, 3.9, 0. , 2. ],
       [0.1, 0.2, 0.4, 0.3, 0.4]])

Now, just worrying about the absolute thresholds, it is as simple as just checking if the values in the differences are greater than a limit.

abs_threshold = 0.5
np.abs(np.diff(df[date_columns].values)) > abs_threshold
#Output:
array([[False,  True, False, False, False],
       [ True,  True, False, False, False],
       [ True,  True,  True, False,  True],
       [False, False, False, False, False]])

We can see that the sum over this array for every row will give us the result we need (sum over boolean arrays use the underlying True=1 and False=0. Thus, you are effectively counting how many True are present). For Percentage thresholds, we just need to do an additional step, dividing all differences with the original values before comparison. Putting it all together.

To elaborate:

We can see how the sum along each row can give us the counts of values crossing absolute threshold as follows.

abs_fluctuations = np.abs(np.diff(df[date_columns].values)) > abs_threshold
print(abs_fluctuations.sum(-1))
#Output:
[1 2 4 0]

To start with relative thresholds, we can create the differences array same as before.

dates = df[date_columns].values #same as before, but just assigned
differences = np.abs(np.diff(dates)) #same as before, just assigned
pct_threshold=0.5 #aka 50%
print(differences.shape) #(4, 5) aka 4 rows, 5 columns if you want to think traditional tabular 2D shapes only
print(dates.shape) #(4, 6) 4 rows, 6 columns

Now, note that the differences array will have 1 less number of columns, which makes sense too. because for 6 dates, there will be 5 "differences", one for each gap.

Now, just focusing on 1 row, we see that calculating percent changes is simple.

print(dates[0][:2]) #for first row[0], take the first two dates[:2]
#Output:
array([0.9, 0.4])
print(differences[0][0]) #for first row[0], take the first difference[0]
#Output:
0.5

a change from 0.9 to 0.4 is a change of 0.5 in absolute terms. but in percentage terms, it is a change of 0.5/0.9 (difference/original) * 100 (where i have omitted the multiplication by 100 to make things simpler) aka 55.555% or 0.5555..

The main thing to realise at this step is that we need to do this division against the "original" values for all differences to get percent changes. However, dates array has one "column" too many. So, we do a simple slice.

dates[:,:-1] #For all rows(:,), take all columns except the last one(:-1).
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7],
       [1.1, 0.3, 1. , 0.7, 0.7],
       [2.1, 0.5, 4. , 0.1, 0.1],
       [1. , 0.9, 1.1, 0.7, 1. ]])

Now, i can just calculate relative or percentage changes by element-wise division

relative_differences = differences / dates[:,:-1]

And then, same thing as before. pick a threshold, see if it's crossed

rel_fluctuations = relative_differences > pct_threshold
#Output:
array([[ True,  True, False, False, False],
       [ True,  True, False, False,  True],
       [ True,  True,  True, False,  True],
       [False, False, False, False, False]])

Now, if we want to consider whether either one of absolute or relative threshold is crossed, we just need to take a bitwise OR | (it's even there in the sentence!) and then take the sum along rows.

Putting all this together, we can just create a function that is ready to use. Note that functions are nothing special, just a way of grouping together lines of code for ease of use. using a function is as simple as calling it, you have been using functions/methods without realising it all the time already.


date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06'] #if hardcoded.
date_columns = df.columns[5:] #if you wish to assign dynamically, and all dates start from 5th column.

def get_FCount(df, date_columns, abs_threshold=0.5, pct_threshold=0.5):
    '''Expects a list of date columns with atleast two values.
        returns a 1D array, with FCounts for every row.
        pct_threshold: percentage, where 1 means 100%
    '''
    dates = df[date_columns].values
    differences = np.abs(np.diff(dates))
    abs_fluctuations = differences > abs_threshold
    rel_fluctuations = differences / dates[:,:-1] > pct_threshold
    return (abs_fluctuations | rel_fluctuations).sum(-1) #we took a bitwise OR. since we are concerned with values that cross even one of the thresholds.

df['FCount'] = get_FCount(df, date_columns) #call our function, and assign the result array to a new column
print(df['FCount'])
#Output:
0    2
1    3
2    4
3    0
Name: FCount, dtype: int32
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
  • Hello thank you for your answer. I clearly understood what you did for the absolute threshold, so Im really happy about that. However for the percentage threshold.I was just confused about, what you meant by dividing the difference with original value. Furthermore, the very last function you made get_FCount. I didn't understood that, is it possible to use, something non function related. Or maybe i was told that if its possible to have a generic threshold like a percentage threshold that can work with both absolute and percentage then it would be better. – krijan Jul 09 '19 at 04:49
  • 2
    no worries. i will admit i got a bit lazy while writing the explanation by the end of it all, i'll add what i did for relative thresholds too. As for the function, this function is working with both absolute and relative thresholds, and takes both into account. Is it the function call that is putting you off or something with how it's being used that's unclear? Anyways, adding the explanation for relative thresholds too. – Paritosh Singh Jul 09 '19 at 04:52
  • Its just how its used thats unclear. Furthermore, ill be honest i have lots and lots of date columns, rather than inputting every date value like you did "date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06']". Is it any way to just filter the dates out? we are speaking 100s of date columns. sorry for hassling. – krijan Jul 09 '19 at 04:55
  • 1
    The question makes sense now after reading this answer. Just to add: to filter date cols you can try with `date_cols=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna()` and then `df.loc[:,date_cols]` considering the format of all the cols are same. – anky Jul 09 '19 at 05:03
  • @anky_91 See the problem here is the format of all the cols are not same. The date columns headers are written in strange manner. So if there is any iloc approach to doing what you done. It will be better. – krijan Jul 09 '19 at 05:18
  • 1
    @krijan alright i added the explanations, but it seems like there's a simpler function for % changes, see anky's answer. now, you will need to have some method to select the correct columns. Think programming: there has to be a way to figure out which columns are dates by code, right? :) – Paritosh Singh Jul 09 '19 at 05:21
  • Correct i agree thats what separates the best programmer from worst. Your answer is much generic as its a function that can work for any df. I agree your work is indeed great aswell as anky. You both are amazing at this. Really appreciate your help and support – krijan Jul 09 '19 at 05:31
  • Ill execute this tomorrow as my working hours has ended. If i come across any issue, ill come back here. :) – krijan Jul 09 '19 at 05:36
  • *"The question makes sense now after reading this answer."* is worrying antipattern, it means the question is indecipherable without this, and hence not reusable resource. @ParitoshSingh I appreciate you put tons of work into this excellent answer (which I upvoted), but you and OP need to also edit the OP's question for basic clarity and brevity - please try to define 'fluctuations (both absolute and relative)' in the first three lines. – smci Jul 11 '19 at 02:04
  • I did major edits on the question; also see my answer which improves on yours by using separate dataframes with one shared `ID` index which gives much cleaner code. – smci Jul 11 '19 at 05:15
  • @ParitoshSingh I showed the response to my senior. He agreed with me that it is correct. However he asked me to modify my approach towards getting the solution. I rewrote the steps that he has asked me to follow this time he and I we are both sure that those steps should be the right way to get the solution. I wanted to request you if you could please give me assistance towards solving it. chat.stackoverflow.com/transcript/message/46743818#46743818 – krijan Jul 12 '19 at 02:03
2

Assuming you want pct_changes() accross all columns in a row with a threshold, you can also try pct_change() on axis=1:

thresh_=0.5
s=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna() #all date cols
df=df.assign(Count=df.loc[:,s].pct_change(axis=1).abs().gt(0.5).sum(axis=1))

Or:

df.assign(Count=df.iloc[:,4:].pct_change(axis=1).abs().gt(0.5).sum(axis=1))

   ID   Domain  Target Criteria  1/01  1/02  1/03  1/04  1/05  1/06  Count
0  A1  Finance     1.0       <=   0.9   0.4   1.0   0.7   0.7   0.9      2
1  B1       IT     2.0       <=   1.1   0.3   1.0   0.7   0.7   1.1      3
2  C1       IT     3.0       >=   2.1   0.5   4.0   0.1   0.1   2.1      4
3  D1  Finance     0.9       >=   1.0   0.9   1.1   0.7   1.0   0.6      0
anky
  • 74,114
  • 11
  • 41
  • 70
  • 3
    nice! i didn't know about `pct_change()` +1. – Paritosh Singh Jul 09 '19 at 05:20
  • 1
    @ParitoshSingh Your answer made the question clearer to me. upvoted. :) – anky Jul 09 '19 at 05:21
  • Ahh thank you very much, so basically. That iloc[:,4:] approach will target all the dates columns i assume. No need to extract the date columns using pd.to_datetime?. – krijan Jul 09 '19 at 05:22
  • @anky_91, I showed the response to my senior. He agreed with me that it is correct. However he asked me to modify my approach towards getting the solution. I rewrote the steps that he has asked me to follow this time he and I we are both sure that those steps should be the right way to get the solution. I wanted to request you if you could please give me assistance towards solving it. https://chat.stackoverflow.com/transcript/message/46743818#46743818 – krijan Jul 12 '19 at 02:01
1

Try a loc and an iloc and a sub and an abs and a sum and an idxmin:

print(df.loc[df.iloc[:, 4:].sub(df['Target'].tolist(), axis='rows').abs().sum(1).idxmin(), 'ID'])

Output:

D1

Explanation:

  • I first get the columns staring from the 4th one, then simply subtract each row with the corresponding Target column.

  • Then get the absolute value of it, so -1.1 will be 1.1 and 1.1 will be still 1.1, then sum each row together and get the row with the lowest number.

  • Then use a loc to get that index in the actual dataframe, and get the ID column of it which gives you D1.

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • 1
    This answer doesn't say anything about the criteria in the dataframe (which are not mentioned in the question but were discussed in chat with your presence), and it contains a completely needless and unreadable `df.apply` call. Just do proper vectorized arithmetic on the dataframe. It will be faster _and_ more readable. But before you do that, wait until the asker clarifies their question. – Andras Deak -- Слава Україні Jul 08 '19 at 01:35
  • Please check the Edit. I believe i had a chat with my boss and i was told what i need to extract. ill just clarify it, Target and criteria column is not related for this. We only need to work with the dates columns. – krijan Jul 08 '19 at 05:33
  • Sorry, this is kind of nasty idiom. `df.loc[df.iloc[...` is a code smell that the wrong choice of index has been made, and `.tolist()` should always be avoidable in good pandas idiom. Also, `.sub(df['Target'].tolist(), axis='rows')` is a code smell that `Target` series should have been subtracted from the entire dataframe `values` (with broadcasting). – smci Jul 11 '19 at 05:43
1

The following is much cleaner pandas idiom and improves on @ParitoshSingh's version. It's much cleaner to keep two separate dataframes:

  • a ts (metadata) dataframe for the timeseries columns 'ID', 'Domain', 'Target','Criteria'
  • a values dataframe for the timeseries values (or 'dates' as the OP keeps calling them)
  • and use ID as the common index for both dataframes, now you get seamless merge/join and also on any results like when we call compute_FCounts().
  • now there's no need to pass around ugly lists of column-names or indices (into compute_FCounts()). This is way better deduplication as mentioned in comments. Code for this is at bottom.

Doing this makes compute_FCount just reduce to a four-liner (and I improved @ParitoshSingh's version to use pandas builtins df.diff(axis=1), and then pandas .abs(); also note that the resulting series is returned with the correct ID index, not 0:3; hence can be used directly in assignment/insertion/merge/join):

def compute_FCount_df(dat, abs_threshold=0.5, pct_threshold=0.5):
    """"""Compute FluctuationCount for all timeseries/rows""""""
    differences = dat.diff(axis=1).iloc[:, 1:].abs()
    abs_fluctuations = differences > abs_threshold
    rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
    return (abs_fluctuations | rel_fluctuations).sum(1)

where the boilerplate to set up two separate dataframes is at bottom.

Also note it's cleaner not to put the fcounts series/column in either values (where it definitely doesn't belong) or ts (where it would be kind of kludgy). Note that the

#ts['FCount'] 
fcounts = compute_FCount_df(values)

>>> fcounts
A1    2
B1    2
C1    4
D1    1

and this allows you to directly get the index (ID) of the timeseries with most 'fluctuations':

>>> fcounts.idxmax()
'C1'

But really since conceptually we're applying the function separately row-wise to each row of timeseries values, we should use values.apply(..., axis=1) :

values.apply(compute_FCount_ts, axis=1, reduce=False) # 

def compute_FCount_ts(dat, abs_threshold=0.5, pct_threshold=0.5):
    """Compute FluctuationCount for single timeseries (row)"""
    differences = dat.diff().iloc[1:].abs()
    abs_fluctuations = differences > abs_threshold
    rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
    return (abs_fluctuations | rel_fluctuations).sum(1)

(Note: still trying to debug the "Too many indexers" pandas issue )

Last, here's the boilerplate code to set up two separate dataframes, with shared index ID:

import pandas as pd
import numpy as np

ts = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
    'Domain': ['Finance', 'IT', 'IT', 'Finance'],
    'Target': [1, 2, 3, '0.9%'],
    'Criteria':['<=', '<=', '>=', '>=']})

values = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
    "1/01":[0.9, 1.1, 2.1, 1],
    "1/02":[0.4, 0.3, 0.5, 0.9],
    "1/03":[1, 1, 4, 1.1],
    "1/04":[0.7, 0.7, 0.1, 0.7],
    "1/05":[0.7, 0.7, 0.1, 1],
    "1/06":[0.9, 1.1, 2.1, 0.6]})
smci
  • 32,567
  • 20
  • 113
  • 146