0

I have a data-frame which has columns like:

colA  colB   colC    colD  colE   flag
 A     X    2018Q1    500   600    1
 A     X    2018Q2    200   800    1 
 A     X    2018Q3    100   400    1
 A     X    2018Q4    500   600    1
 A     X    2019Q1    400   7000   0
 A     X    2019Q2    1500  6100   0
 A     X    2018Q3    5600  600    1
 A     X    2018Q4    500   6007   1

 A     Y    2016Q1    900   620    1
 A     Y    2016Q2    750   850    0
 A     Y    2017Q1    750   850    1
 A     Y    2017Q2    750   850    1
 A     Y    2017Q3    750   850    1
 A     Y    2018Q1    750   850    1
 A     Y    2018Q2    750   850    1
 A     Y    2018Q3    750   850    1
 A     Y    2018Q4    750   850    1

A row at colA, colB level passes a statistical check if at colA, colB level the value of flag==1 for continuous 4 quarters of data after sorting for one stride. We have to stride like this: 2018Q1-2018Q4 then 2018Q2-2019Q1 .... so on if there is 4 continuous quarter and flag==1 then we lable that as 1.

The final output will be like:

colA  colB   colC    colD  colE   flag  check_qtr
 A     X    2018Q1    500   600    1     1
 A     X    2018Q2    200   800    1     1
 A     X    2018Q3    100   400    1     1
 A     X    2018Q4    500   600    1     1
 A     X    2019Q1    400   7000   0     0   
 A     X    2019Q2    1500  6100   0     0
 A     X    2018Q3    5600  600    1     0
 A     X    2018Q4    500   6007   1     0

 A     Y    2016Q1    900   620    1     0
 A     Y    2016Q2    750   850    0     0
 A     Y    2017Q1    750   850    1     0
 A     Y    2017Q2    750   850    1     0
 A     Y    2017Q3    750   850    1     0
 A     Y    2018Q1    750   850    1     1
 A     Y    2018Q2    750   850    1     1   
 A     Y    2018Q3    750   850    1     1
 A     Y    2018Q4    750   850    1     1
  1. How can we do this using pandas and numpy?
  2. Can we implemet this is using sql?
MAC
  • 1,345
  • 2
  • 30
  • 60
  • Which database you are using ? – Digvijay S Mar 12 '20 at 07:47
  • Amazon's redshift – MAC Mar 12 '20 at 08:29
  • In your sample data 7th row have 2018Q2 i.e data is not sorted. Its this the case or sample data is like that ? – Digvijay S Mar 12 '20 at 09:43
  • 1
    if above is not the case then you can use below sql `select colA , colB, colC , colD, colE, flag, CASE WHEN flag = 1 AND lead (flag, 1) OVER(PARTITION BY COLA ORDER BY colC ASC) = 1 AND lead (flag, 2) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 AND lead (flag, 3) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 THEN 1 ELSE 0 END as check_qtr from tab order by colB, colC;` – Digvijay S Mar 12 '20 at 09:44
  • How to write this sql code in python? – MAC Mar 12 '20 at 11:07

1 Answers1

1

Concerning your first question, this can be done like this using pandas: First i'll generate your example dataframe:

import pandas as pd
df = pd.DataFrame({'colA':['A']*17,
                   'colB':['X']*8+['Y']*9,
                   'flag':[1,1,1,1,0,0,1,1,1,0,1,1,1,1,1,1,1]})
df.set_index(['colA','colB'], inplace=True) # Set index as multilevel with colA and colB

Resulting in your example dataframe. However, to use the following approach, we'll need to go back to a normal index:

df.reset_index(inplace=True)

  colA colB flag
0   A   X    1
1   A   X    1
2   A   X    1
3   A   X    1
4   A   X    0
5   A   X    0
6   A   X    1
7   A   X    1
8   A   Y    1
9   A   Y    0
10  A   Y    1
11  A   Y    1
12  A   Y    1
13  A   Y    1 
14  A   Y    1 
15  A   Y    1
16  A   Y    1

Then to obtain your result column you can use the groupby function (with some print to understand what's going on):

from scipy.ndimage.interpolation import shift
import numpy as np
df['check_qtr'] = pd.Series(0,index=df.index) # Initialise your result column 
for name, group in df.groupby(['colA','colB','flag']):
    if name[2] == 1:
        print(name)
        idx = ((group.index.values - shift(group.index.values, 1, cval=-1)) == 1).astype(int) # Is the index of the following value just 1 place after current ?
        print(idx)
        bools = [idx[x:x+4].sum()==4 for x in range(len(idx))] # Are the 4 next indexes following each others ?
        print(bools)
        for idx in group.index.values[bools]: # For each index where the 4 next indexes following each others
            df.loc[idx:idx+3,'check_qtr'] = 1 #set check_qtr in row idx to row idx+3

('A', 'X', 1)
[1 1 1 1 0 1]
[True, False, False, False, False, False]
('A', 'Y', 1)
[0 0 1 1 1 1 1 1]
[False, False, True, True, True, False, False, False]

Note that we are using +4 in the case where we are doing array indexing. Because array[x:x+4] will give you the 4 values at index x to x+3. We are using +3 when using loc because loc doesn't use the same logic. It retrieves indexes by name and not position. So between value idx and idx+3 we'll get 4 values.

Giving you the result you want:

       colA    colB    flag check_qtr
 0      A       X       1      1
 1      A       X       1      1
 2      A       X       1      1
 3      A       X       1      1
 4      A       X       0      0
 5      A       X       0      0
 6      A       X       1      0
 7      A       X       1      0
 8      A       Y       1      0
 9      A       Y       0      0
 10     A       Y       1      0
 11     A       Y       1      1
 12     A       Y       1      1
 13     A       Y       1      1
 14     A       Y       1      1
 15     A       Y       1      1
 16     A       Y       1      1

This may not be the perfect way to do it, but it can give you some hints about how you can use some of those functions !

Bruce Swain
  • 583
  • 3
  • 10
  • Can we execute this sql code `select colA , colB, colC , colD, colE, flag, CASE WHEN flag = 1 AND lead (flag, 1) OVER(PARTITION BY COLA ORDER BY colC ASC) = 1 AND lead (flag, 2) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 AND lead (flag, 3) OVER(PARTITION BY COLB ORDER BY colC ASC) = 1 THEN 1 ELSE 0 END as check_qtr from tab order by colB, colC;` in python3?? – MAC Mar 12 '20 at 11:50
  • I'm not really a SQL expert, but you got two choice i think.In both case starting from your SQL raw data, you either format it before extracting it with this SQL, or extract it raw from your DB and format it with pandas. – Bruce Swain Mar 12 '20 at 12:00
  • Sql makes code more shorter that's the reason I asked if we can run sql code in python with pandas data frame? – MAC Mar 12 '20 at 16:15
  • The results arent matching – MAC Mar 12 '20 at 16:17
  • This question may help you with your goal then : https://stackoverflow.com/questions/45865608/executing-an-sql-query-over-a-pandas-dataset And what part of the results are not matching ? – Bruce Swain Mar 13 '20 at 08:04