3

Let the sample df (df1) be,

sample data

We can achieve df2 or final data-frame by manipulating the data of df1 in the following manner,

Step 1: Remove all positive numbers including zeros

After Step 1 the sample data should look like,

sample data after Step1

Step 2: If column4 A row is a negative number and column4 B is blank, then remove the -ve number of column4 A row

Step 3: If column4 A row is blank and column4 B is a negative number, then keep the -ve number of column4 B row

sample data after step2 and 3

After Steps 1,2 and 3 are done,

Step 4: If both A and B of column4 are negative then,

For each A and B row of column4, check the left-side (LHS) value (for a given month) of the same A and B row of column4

Step 4.1: If either of the LHS values of A or B is a -ve number, then delete the current row value of B column4 and keep the current row value of A column4

After Step 4.1, the sample data should look like this,

sample data after step 4.1

Step 4.2:

If the LHS value of A and B column4 is blank, then keep the current row value of B column4 and delete the current row value of A column4

Sample data after Step 4.2 should look like,

sample data after step4.2

Since we see two negative numbers still, we perform Step 4.1 again and then the final data-frame or df2 will look like,

sample data after Step4.1

How may I achieve the above using pandas? I was able to achieve till Step 1 but have no idea as to how to proceed further. Any help would be greatly appreciated.

This is the approach that I took,

import pandas as pd
df = pd.read_excel('df1.xlsx', engine='openpyxl')
df.to_pickle('./df1.pkl')
unpickled_df = pd.read_pickle('./df1.pkl')
rem_cols = ['column2', 'column3', 'column5', 'column6', 'column7']
unpickled_df['g'] = unpickled_df.groupby(['column1', 'column4'] ).cumcount()
df1 = unpickled_df.drop(rem_cols, axis=1)
df1 = df1.set_index(['column1','g', 'column4'])
df1.columns = pd.to_datetime(df1.columns, format='%b-%y').strftime('%b-%y')
first_date = df1.columns[0]
df1 = df1.unstack(-1)
df1 = df1.mask(df1.ge(0))
m1 = (df1.xs('A', level=1, axis=1, drop_level=False).notna() & 
      df1.xs('B', level=1, axis=1, drop_level=False).rename(columns={'B':'A'}, level=1).isna())
m2 = (df1.xs('B', level=1, axis=1, drop_level=False).notna() &
      df1.xs('A', level=1, axis=1, drop_level=False).rename(columns={'A':'B'}, level=1).isna())

m = m1.join(m2)
df1 = df1.mask(m)
df2 = df1.groupby(level=1, axis=1).shift(1, axis=1)
mask1 = df1.notna() & df2.isna() & (df1.columns.get_level_values(1) == 'A')[ None, :]
mask1[first_date] = False
mask2 = df1.notna() & df2.notna() & (df1.columns.get_level_values(1) == 'B')[ None, :]
df1 = df1.mask(mask1).mask(mask2).stack(dropna=False)
unpickled_df = unpickled_df[rem_cols + ['column1','g', 'column4']].join(df1, on=['column1','g', 'column4'])
#print(unpickled_df)

Small Test data: df1,

{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [0, 10, 0, 0], 'Mar-21': [0, 0, 70, 70], 'Apr-21': [-10, -10, -8, 60], 'May-21': [-30, -60, -10, 40], 'Jun-21': [-20, 9, -40, -20], 'Jul-21': [30, -10, 0, -20], 'Aug-21': [-30, -20, 0, -20], 'Sep-21': [0, -15, 0, -20], 'Oct-21': [0, -15, 0, -20]}

df2 (expected output),

{'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan], 'Apr-21': [nan, -10.0, nan, nan], 'May-21': [-30.0, nan, nan, nan], 'Jun-21': [nan, nan, nan, -20.0], 'Jul-21': [nan, -10.0, nan, -20.0], 'Aug-21': [-30.0, nan, nan, -20.0], 'Sep-21': [nan, -15.0, nan, -20.0], 'Oct-21': [nan, -15.0, nan, -20.0]}

Test data:

df1

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [11655, 0, 0, 0, 121117, 0, 14948, 0, 0, 0, 0, 0, 0, 0, 1838, 0, 0, 0], 'Mar-21': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -16474.0, -16474.0, 7000.0, 7000.0, -19946.0, -19946.0, 16084.44444444444, 0.0, 0.0, 0.0], 'Apr-21': [104815.0, 104815.0, 17949.0, 17949.0, 96132.0, 96132.0, 0.0, 0.0, -17001.0, -33475.0, -878.0, 6122.0, 8398.0, -11548.0, -5297.073170731703, -5297.073170731703, -282.0, -282.0], 'May-21': [78260.0, 183075.0, 42557.0, 60506.0, -15265.0, 80867.0, -18.0, -18.0, 21084.0, -12391.0, -1831.0, 4291.0, 2862.0, -8686.0, 5261.25, -35.8231707317027, -369.0, -651.0], 'Jun-21': [-52480.0, 130595.0, -13258.0, 47248.0, -35577.0, 45290.0, 2434.0, 2416.0, 31147.0, 18756.0, -4310.0, -19.0, -4750.0, -13436.0, -92.0, -127.8231707317027, -280.0, -931.0], 'Jul-21': [-174544.0, -43949.0, -38127.0, 9121.0, -124986.0, -79696.0, -9707.0, -7291.0, 13577.0, 32333.0, 0.0, -19.0, -15746.0, -29182.0, 93.0, -34.8231707317027, -319.0, -1250.0], 'Aug-21': [35498.0, -8451.0, -37094.0, -27973.0, 79021.0, -675.0, -1423.0, -8714.0, 32168.0, 64501.0, 0.0, -19.0, 18702.0, -10480.0, 4347.634146341465, 4312.810975609762, -341.0, -1591.0], 'Sep-21': [44195.0, 35744.0, 2039.0, -25934.0, 70959.0, 70284.0, 2816.0, -5898.0, 38359.0, 102860.0, 0.0, -19.0, 18119.0, 7639.0, 5302.222222222219, 9615.033197831981, 0.0, -1591.0], 'Oct-21': [-13163.0, 22581.0, -4773.0, -30707.0, 205080.0, 275364.0, -709.0, -6607.0, -1397.0, 101463.0, 0.0, -19.0, 0.0, 7639.0, -34.0, 9581.033197831981, 0.0, -1591.0]}

df2 (expected output) ,

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, -16474.0, nan, nan, nan, -19946.0, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, -17001.0, nan, nan, nan, nan, -11548.0, nan, -5297.073170731703, nan, -282.0], 'May-21': [nan, nan, nan, nan, nan, nan, nan, -18.0, nan, -12391.0, nan, nan, nan, -8686.0, nan, -35.8231707317027, -369.0, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -19.0, -4750.0, nan, -92.0, nan, -280.0, nan], 'Jul-21': [nan, -43949.0, nan, nan, nan, -79696.0, nan, -7291.0, nan, nan, nan, -19.0, -15746.0, nan, nan, -34.8231707317027, -319.0, nan], 'Aug-21': [nan, -8451.0, nan, -27973.0, nan, -675.0, -1423.0, nan, nan, nan, nan, -19.0, nan, -10480.0, nan, nan, -341.0, nan], 'Sep-21': [nan, nan, nan, -25934.0, nan, nan, nan, -5898.0, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0], 'Oct-21': [nan, nan, -4773.0, nan, nan, nan, -709.0, nan, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0]}

Note: I have implemented my code on the basis of the Test data provided. The sample data is merely to focus on the columns that are supposed to be manipulated.

royalewithcheese
  • 402
  • 4
  • 17
  • Seems like the space is less for posting df2. I shall post it in the following comments. – royalewithcheese Mar 01 '21 at 12:38
  • 2
    You should post the code that allowed you to reach step 3. – Serge Ballesta Mar 01 '21 at 12:42
  • @SergeBallesta, I have updated the question with the code. I tried implementing till Step4.2 for the Test data but couldn't get 100% correct expected output. I hope the code helps and an elaborate explanation helps. Please let me know should you have any questions. – royalewithcheese Mar 02 '21 at 09:28
  • There is a typo in the df1 for july-21.. the first value in images is 30 while in dataset you provide its -30. – Akshay Sehgal Mar 03 '21 at 14:14
  • This has already been answered before many times. Treating the blanks in your examples above as `None` or `numpy.nan` in pandas, this question is a duplicate of https://stackoverflow.com/questions/31511997. Suggest you take down your bounty so it can be marked as such. – François Leblanc Mar 03 '21 at 16:53
  • @FrançoisLeblanc - the comparison from right to left and not up to down. If you think this is doable from your side, you're most welcome to take the bounty :) – royalewithcheese Mar 04 '21 at 02:31
  • Also I don't have the liberty to manipulate the data by using the column headers because month-year header column can anything from Jan-xx to Dec-xx. – royalewithcheese Mar 04 '21 at 03:22

1 Answers1

1

What about this ?

For each step, I group on column1, then set column4 as index and work on the transpose matrix with your criteria. Note that I've extrapoled a bit on your criteria to match your attended results (I hope it is correct but you will have to check that).

Note also that I have kept each step separate to make it easier to read. But it would be more efficient to make the grouping/indexing/transposing in one shot and work on your algorithm from there.

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)


df1 = pd.DataFrame(
        {'column1': ['ABC', 'ABC', 'CDF', 'CDF'], 'column4': ['A', 'B', 'A', 'B'], 'Feb-21': [0, 10, 0, 0], 'Mar-21': [0, 0, 70, 70], 'Apr-21': [-10, -10, -8, 60], 'May-21': [-30, -60, -10, 40], 'Jun-21': [-20, 9, -40, -20], 'Jul-21': [30, -10, 0, -20], 'Aug-21': [-30, -20, 0, -20], 'Sep-21': [0, -15, 0, -20], 'Oct-21': [0, -15, 0, -20]}
        )

print(df1)
print('-'*50)

df = df1.copy()

#step1 :
df.iloc[:, 2:] = df.iloc[:, 2:].where(df.iloc[:, 2:] < 0, np.nan)


print(df)
print('-'*50)

#step2 :
def step2(df):
    df = df.set_index("column4").T
    ix = df[(df.A<=0) & (df.B.isnull())].index
    df.loc[ix, "A"] = np.nan
    return df.T
df = df.groupby('column1').apply(step2)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)


#step3 :
def step3(df):
    df = df.set_index("column4").T
    ix = df[(df.A.isnull()) & (df.B>=0)].index
    df.loc[ix, "B"] = df.loc[ix, "A"]
    return df.T
df = df.groupby('column1').apply(step3)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)

#step4 :
def step4(df):
    df = df.set_index("column4").T
    a_pos = df.columns.get_loc('A')
    b_pos = df.columns.get_loc('B')
    #step 4.1
    ix = df[(df.A<0) & (df.B<0)].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].notnull().any(axis=1)
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, b_pos] = np.nan
    
    #step 4.2
    ix = df[(df.A<0) & (df.B<0)].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].isnull().all(axis=1)    
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, a_pos] = np.nan
    
    #step 4.1 (again)
    ix = df[(df.A<0) & (df.B<0)].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].notnull().any(axis=1)
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, b_pos] = np.nan
        
    return df.T

df = df.groupby('column1').apply(step4)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)

EDIT

I'll assume here (based on your previous comment) that your dataframe will always be composed of A/B rows in alternance (and that the order in the dataframe is valid). We will then need to compute an artificial index to indentify each pair of rows.

Note that I used a fillna on your columns (mainly column5) as it is good practice with the groupby commands. Due to multiple levels of columns, I'm not sure it will have an impact anyway...

Boolean indexing begins to be tricky when managing multiple levels of columns. You will see I will compute each "column" to a numpy.array (using the .values method). Somehow, pandas won't perform the boolean match on multiple columns, I'm not exactly sure why.

So this goes :

df = pd.DataFrame(  ...  )

#Compute the unique index for each pair of rows
df.reset_index(drop=False, inplace=True)
ix = df.index
ix = ix[ix%2==0]
df.loc[ix, 'index'] = df.shift(-1).loc[ix, 'index']

#step1 :
cols = [x for x in df.columns.tolist() if not x.startswith('column') and x != "index"]
df[cols] = df[cols].where(df[cols] < 0, np.nan)


cols_index = ["column4", "column1", "column2", "column3", "column5", "column6", "column7"]
df[cols_index] = df[cols_index].fillna(-1)

#step2 :
def step2(df):
    df = df.set_index(cols_index).drop('index', axis=1).T
    ix = df[
            (df.A<=0).values
            & df.B.isnull().values
         ].index
    df.loc[ix, "A"] = np.nan
    return df.T
df = df.groupby('index').apply(step2)
print(df)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)


#step3 :
def step3(df):
    df = df.set_index(cols_index).drop('index', axis=1).T
    ix = df[
            df.A.isnull().values 
            & (df.B>=0).values
            ].index
    df.loc[ix, "B"] = df.loc[ix, "A"]
    return df.T
df = df.groupby('index').apply(step3)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)

#step4 :
def step4(df):
    df = df.set_index(cols_index).drop('index', axis=1).T
    a_pos = df.columns.get_loc('A')
    b_pos = df.columns.get_loc('B')
    
    #step 4.1
    ix = df[
            (df.A<0).values
            & (df.B<0).values
            ].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].notnull().any(axis=1)
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, b_pos] = np.nan
    
    #step 4.2
    ix = df[
            (df.A<0).values
            & (df.B<0).values
            ].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].isnull().all(axis=1)    
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, a_pos] = np.nan
    
    #step 4.1 (again)
    ix = df[
            (df.A<0).values
            & (df.B<0).values
            ].index
    if len(ix):
        ix = df.index.get_indexer(ix)
        left_pos = ix-1
        condition_left = df.iloc[left_pos].notnull().any(axis=1)
        ix = condition_left[condition_left].index
        ix = df.index.get_indexer(ix)
        df.iloc[ix+1, b_pos] = np.nan
        
    return df.T

df = df.groupby('index').apply(step4)
df.reset_index(drop=False, inplace=True)
print(df)
print('-'*50)

And if you want to restore your column5 :

df[cols_index] = df[cols_index].replace(-1, np.nan)
tgrandje
  • 2,332
  • 11
  • 33
  • Thanks, did you try running your algorithm using the test data? If so, may I know what would I need to change in your current code? – royalewithcheese Mar 07 '21 at 08:59
  • So, I did try to run your code with the Test data and got the following error in step2, `ValueError: cannot reindex from a duplicate axis` – royalewithcheese Mar 07 '21 at 09:27
  • Would you happen to have any insights on how to deal with this problem? or if anything is unclear please let me know because I feel this is really close to the solution. – royalewithcheese Mar 07 '21 at 13:10
  • Sorry, I didn't see that your other sample had a different structure. This exception is raised because of this (in fact, it's because column1 can be duplicated more than once). To help you, I will need to know more about the indexing of your dataframe : can you indentify an unique couple using a combination of column1 to column 7 ? Or can we assume that your dataset will always be constructed with rows in alternance ? – tgrandje Mar 07 '21 at 13:31
  • No problem. First question: I don't think we have any combinations that we may use them uniquely. Second question: Not sure what you mean exactly. The structure of the data, that I am working with, is of the same structure, that I can say for sure. – royalewithcheese Mar 07 '21 at 13:37
  • *The structure of the Test data.. (For some reason I can't edit my comment) – royalewithcheese Mar 07 '21 at 13:48
  • 1
    Have edited my answer. Hope this will match your real dataset (my code should make my previous comment more concrete) – tgrandje Mar 07 '21 at 14:24
  • Now, I understand what you mean. The assumption is correct, they are always in alternate. Thanks so much for your efforts. Since the bounty is expiring soon and I really want to give you the bounty rep, hence I am giving it to you, however, I might or might not (hopefully not) need your help in this question should there be discrepancy in near future. – royalewithcheese Mar 09 '21 at 05:24
  • 1
    Anytime ! Though if that's because my explanations are not clear enough, feal free to share the problematic parts with me ;-) – tgrandje Mar 09 '21 at 08:08