How to do time continuity checks using python in pandas data-frame

Question

I have a data-frame which has columns like:

colA  colB   colC    colD  colE   flag
 A     X    2018Q1    500   600    1
 A     X    2018Q2    200   800    1 
 A     X    2018Q3    100   400    1
 A     X    2018Q4    500   600    1
 A     X    2019Q1    400   7000   0
 A     X    2019Q2    1500  6100   0
 A     X    2018Q3    5600  600    1
 A     X    2018Q4    500   6007   1

 A     Y    2016Q1    900   620    1
 A     Y    2016Q2    750   850    0
 A     Y    2017Q1    750   850    1
 A     Y    2017Q2    750   850    1
 A     Y    2017Q3    750   850    1
 A     Y    2018Q1    750   850    1
 A     Y    2018Q2    750   850    1
 A     Y    2018Q3    750   850    1
 A     Y    2018Q4    750   850    1

A row at colA, colB level passes a statistical check if at colA, colB level the value of flag==1 for continuous 4 quarters of data after sorting for one stride. We have to stride like this: 2018Q1-2018Q4 then 2018Q2-2019Q1 .... so on if there is 4 continuous quarter and flag==1 then we lable that as 1.

The final output will be like:

colA  colB   colC    colD  colE   flag  check_qtr
 A     X    2018Q1    500   600    1     1
 A     X    2018Q2    200   800    1     1
 A     X    2018Q3    100   400    1     1
 A     X    2018Q4    500   600    1     1
 A     X    2019Q1    400   7000   0     0   
 A     X    2019Q2    1500  6100   0     0
 A     X    2018Q3    5600  600    1     0
 A     X    2018Q4    500   6007   1     0

 A     Y    2016Q1    900   620    1     0
 A     Y    2016Q2    750   850    0     0
 A     Y    2017Q1    750   850    1     0
 A     Y    2017Q2    750   850    1     0
 A     Y    2017Q3    750   850    1     0
 A     Y    2018Q1    750   850    1     1
 A     Y    2018Q2    750   850    1     1   
 A     Y    2018Q3    750   850    1     1
 A     Y    2018Q4    750   850    1     1

How can we do this using pandas and numpy?
Can we implemet this is using sql?

In your sample data 7th row have 2018Q2 i.e data is not sorted. Its this the case or sample data is like that ? — Digvijay S, Mar 12 '20 at 09:43
if above is not the case then you can use below sql `select colA , colB, colC , colD, colE, flag, CASE WHEN flag = 1 AND lead (flag, 1) OVER(PARTITION BY COLA ORDER BY colC ASC) = 1 AND lead (flag, 2) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 AND lead (flag, 3) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 THEN 1 ELSE 0 END as check_qtr from tab order by colB, colC;` — Digvijay S, Mar 12 '20 at 09:44

Bruce Swain · Answer 1 · 2020-03-12T09:54:15.890

Concerning your first question, this can be done like this using pandas: First i'll generate your example dataframe:

import pandas as pd
df = pd.DataFrame({'colA':['A']*17,
                   'colB':['X']*8+['Y']*9,
                   'flag':[1,1,1,1,0,0,1,1,1,0,1,1,1,1,1,1,1]})
df.set_index(['colA','colB'], inplace=True) # Set index as multilevel with colA and colB

Resulting in your example dataframe. However, to use the following approach, we'll need to go back to a normal index:

df.reset_index(inplace=True)

  colA colB flag
0   A   X    1
1   A   X    1
2   A   X    1
3   A   X    1
4   A   X    0
5   A   X    0
6   A   X    1
7   A   X    1
8   A   Y    1
9   A   Y    0
10  A   Y    1
11  A   Y    1
12  A   Y    1
13  A   Y    1 
14  A   Y    1 
15  A   Y    1
16  A   Y    1

Then to obtain your result column you can use the groupby function (with some print to understand what's going on):

from scipy.ndimage.interpolation import shift
import numpy as np
df['check_qtr'] = pd.Series(0,index=df.index) # Initialise your result column 
for name, group in df.groupby(['colA','colB','flag']):
    if name[2] == 1:
        print(name)
        idx = ((group.index.values - shift(group.index.values, 1, cval=-1)) == 1).astype(int) # Is the index of the following value just 1 place after current ?
        print(idx)
        bools = [idx[x:x+4].sum()==4 for x in range(len(idx))] # Are the 4 next indexes following each others ?
        print(bools)
        for idx in group.index.values[bools]: # For each index where the 4 next indexes following each others
            df.loc[idx:idx+3,'check_qtr'] = 1 #set check_qtr in row idx to row idx+3

('A', 'X', 1)
[1 1 1 1 0 1]
[True, False, False, False, False, False]
('A', 'Y', 1)
[0 0 1 1 1 1 1 1]
[False, False, True, True, True, False, False, False]

Note that we are using +4 in the case where we are doing array indexing. Because array[x:x+4] will give you the 4 values at index x to x+3. We are using +3 when using loc because loc doesn't use the same logic. It retrieves indexes by name and not position. So between value idx and idx+3 we'll get 4 values.

Giving you the result you want:

       colA    colB    flag check_qtr
 0      A       X       1      1
 1      A       X       1      1
 2      A       X       1      1
 3      A       X       1      1
 4      A       X       0      0
 5      A       X       0      0
 6      A       X       1      0
 7      A       X       1      0
 8      A       Y       1      0
 9      A       Y       0      0
 10     A       Y       1      0
 11     A       Y       1      1
 12     A       Y       1      1
 13     A       Y       1      1
 14     A       Y       1      1
 15     A       Y       1      1
 16     A       Y       1      1

This may not be the perfect way to do it, but it can give you some hints about how you can use some of those functions !

Can we execute this sql code `select colA , colB, colC , colD, colE, flag, CASE WHEN flag = 1 AND lead (flag, 1) OVER(PARTITION BY COLA ORDER BY colC ASC) = 1 AND lead (flag, 2) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 AND lead (flag, 3) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 THEN 1 ELSE 0 END as check_qtr from tab order by colB, colC;` in python3?? — MAC, Mar 12 '20 at 11:50
I'm not really a SQL expert, but you got two choice i think.In both case starting from your SQL raw data, you either format it before extracting it with this SQL, or extract it raw from your DB and format it with pandas. — Bruce Swain, Mar 12 '20 at 12:00
Sql makes code more shorter that's the reason I asked if we can run sql code in python with pandas data frame? — MAC, Mar 12 '20 at 16:15
This question may help you with your goal then : https://stackoverflow.com/questions/45865608/executing-an-sql-query-over-a-pandas-dataset And what part of the results are not matching ? — Bruce Swain, Mar 13 '20 at 08:04

How to do time continuity checks using python in pandas data-frame

1 Answers1