-2

Supposing I have a df like the following,

col1   | type   | date_1 | date_2 | date_3 |.... | date_n
ab     |   A    |  -10   |        |  -10
ab     |   B    |  100   |   99   |  -12
cd     |   A    |   0    |  -25   |   6
cd     |   B    |  -1    |   8    |  -34
ab     |   A    |   98   |  -9    |   0
ab     |   B    |  -7    |  -2    |   0

First step is to remove all positive numbers including 0

Now the df should look like,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |  -10   |        |  -10   |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |  -25   |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |  -2    |        |

Second step is to compare the numbers for each 'date' col as per 'type' A and B,

  • If the 'type' A row has a negative number and 'type' B is blank, then remove the negative number, of 'date' col, of 'type' A

  • If the 'type' B row has a negative number and 'type' A is blank, then do nothing

  • If both types are blank do nothing

After this step, the df should look like this,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |        |        |  -10   |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |        |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |  -2    |        |

Final step,

  • If both types are negative for the current, for each set of col1 (ab,cd,ab), check the left-hand-side value of same Ath and Bth of the same row,

    1) If both types A and B values are blank, then remove the remove the negative number of current row 'type' A and keep the -ve number of 'type' B
    
    2) If either of the types are blank, then remove the negative of the current row 'type' B and keep the -ve number of 'type' A
    

Finally, the final_df should look like this,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |        |        |        |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |        |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |        |        |

For the final step, the comparison should start from the 'date_2'.

What would be the best way to solve this problem? Any help would be greatly appreciated!

Note: I cannot use the column headers (the date ones) to manipulate data because they will keep changing.

Test Data:

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [11655, 0, 0, 0, 121117, 0, 14948, 0, 0, 0, 0, 0, 0, 0, 1838, 0, 0, 0], 'Mar-21': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -16474.0, -16474.0, 7000.0, 7000.0, -19946.0, -19946.0, 16084.44444444444, 0.0, 0.0, 0.0], 'Apr-21': [104815.0, 104815.0, 17949.0, 17949.0, 96132.0, 96132.0, 0.0, 0.0, -17001.0, -33475.0, -878.0, 6122.0, 8398.0, -11548.0, -5297.073170731703, -5297.073170731703, -282.0, -282.0], 'May-21': [78260.0, 183075.0, 42557.0, 60506.0, -15265.0, 80867.0, -18.0, -18.0, 21084.0, -12391.0, -1831.0, 4291.0, 2862.0, -8686.0, 5261.25, -35.8231707317027, -369.0, -651.0], 'Jun-21': [-52480.0, 130595.0, -13258.0, 47248.0, -35577.0, 45290.0, 2434.0, 2416.0, 31147.0, 18756.0, -4310.0, -19.0, -4750.0, -13436.0, -92.0, -127.8231707317027, -280.0, -931.0], 'Jul-21': [-174544.0, -43949.0, -38127.0, 9121.0, -124986.0, -79696.0, -9707.0, -7291.0, 13577.0, 32333.0, 0.0, -19.0, -15746.0, -29182.0, 93.0, -34.8231707317027, -319.0, -1250.0], 'Aug-21': [35498.0, -8451.0, -37094.0, -27973.0, 79021.0, -675.0, -1423.0, -8714.0, 32168.0, 64501.0, 0.0, -19.0, 18702.0, -10480.0, 4347.634146341465, 4312.810975609762, -341.0, -1591.0], 'Sep-21': [44195.0, 35744.0, 2039.0, -25934.0, 70959.0, 70284.0, 2816.0, -5898.0, 38359.0, 102860.0, 0.0, -19.0, 18119.0, 7639.0, 5302.222222222219, 9615.033197831981, 0.0, -1591.0], 'Oct-21': [-13163.0, 22581.0, -4773.0, -30707.0, 205080.0, 275364.0, -709.0, -6607.0, -1397.0, 101463.0, 0.0, -19.0, 0.0, 7639.0, -34.0, 9581.033197831981, 0.0, -1591.0]}

Expected output:

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, -16474.0, nan, nan, nan, -19946.0, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, -17001.0, nan, nan, nan, nan, -11548.0, nan, -5297.073170731703, nan, -282.0], 'May-21': [nan, nan, nan, nan, nan, nan, nan, -18.0, nan, -12391.0, nan, nan, nan, -8686.0, nan, -35.8231707317027, -369.0, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -19.0, -4750.0, nan, -92.0, nan, -280.0, nan], 'Jul-21': [nan, -43949.0, nan, nan, nan, -79696.0, nan, -7291.0, nan, nan, nan, -19.0, -15746.0, nan, nan, -34.8231707317027, -319.0, nan], 'Aug-21': [nan, -8451.0, nan, -27973.0, nan, -675.0, -1423.0, nan, nan, nan, nan, -19.0, nan, -10480.0, nan, nan, -341.0, nan], 'Sep-21': [nan, nan, nan, -25934.0, nan, nan, nan, -5898.0, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0], 'Oct-21': [nan, nan, -4773.0, nan, nan, nan, -709.0, nan, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0]}
royalewithcheese
  • 402
  • 4
  • 17
  • 4
    Have you tried anything so far? – Muhammad Junaid Haris Feb 04 '21 at 08:56
  • Is possible explain more final step? why for `ef` is removed `-2` annd for `ab` `-10` ? – jezrael Feb 04 '21 at 09:28
  • What does mean `check the left-hand-side of the same row` ? – jezrael Feb 04 '21 at 09:29
  • for final step, we start with col date_2 and compare the values with that of date_1. For ef we remove -2 because to the left, that is, values in col date_1 are negative and according to the condition mentioned, the value of the corresponding type B is removed and type A is kept. Hope this helps! – royalewithcheese Feb 04 '21 at 09:34
  • @MuhammadJunaidHaris I am actually struggling with the Final Step. I thought it would be better to explain the entire problem to see if someone has a better/efficient way to approach this problem :) – royalewithcheese Feb 04 '21 at 09:34
  • So it means `date_2, date_3`... are selected `A` or `B` by check `date_1` ? So if exist values there (in date_1) like `ef` then is removed `B` else removed `A` ? – jezrael Feb 04 '21 at 09:40
  • no..date_2 will be compared with date_1, date_3 will be compared with date_2 and so on. Please let me know if anything else is unclear. – royalewithcheese Feb 04 '21 at 09:43
  • Ok, so -10 removed, because no values in `date_2` for `ab`'? – jezrael Feb 04 '21 at 09:52
  • 1
    that is correct according to second point under final step.. – royalewithcheese Feb 04 '21 at 09:55
  • can you provide more details about your last steps? I have created a rough solution up till 2nd step but further details are little confusing. – k33da_the_bug Feb 04 '21 at 10:28
  • for the final step, if we take date_2 col, you will see that the values for ef's A and B (date_1) are both negative, and according to the mentioned condition, type B will be removed, that is, -2. Now, if we consider date_3 col, since the cells of date_2 for ab is blank, then -10 will be removed according to the mentioned condition. – royalewithcheese Feb 04 '21 at 11:23

2 Answers2

2
df = pd.DataFrame(d)

rem_cols = [ 'col2', 'subtype',  'col3', 'col4', 'col5']
df['g'] = df.groupby(['col1', 'type'] ).cumcount()

df1 = df.drop(rem_cols, axis=1)

df1 = df1.set_index(['col1','g', 'type'])


df1.columns = pd.to_datetime(df1.columns)

first_date = df1.columns[0]
df1 = df1.unstack(-1)

# print (df1.stack(dropna=False).reset_index())

df1 = df1.mask(df1.ge(0))

m1 = (df1.xs('A', level=1, axis=1, drop_level=False).notna() & 
      df1.xs('B', level=1, axis=1, drop_level=False).rename(columns={'B':'A'}, level=1).isna())
m2 = (df1.xs('B', level=1, axis=1, drop_level=False).notna() &
      df1.xs('A', level=1, axis=1, drop_level=False).rename(columns={'A':'B'}, level=1).isna())

m = m1.join(m2)

df1 = df1.mask(m)
# print (df1)

df2 = df1.groupby(level=1, axis=1).shift(1, axis=1)
# print (df2)

mask1 = df1.notna() & df2.isna() & (df1.columns.get_level_values(1) == 'A')[ None, :]
#avoid change values for date_1
mask1[first_date] = False

mask2 = df1.notna() & df2.notna() & (df1.columns.get_level_values(1) == 'B')[ None, :]

df1 = df1.mask(mask1).mask(mask2).stack(dropna=False)
# print (df1)

df = df[rem_cols + ['col1','g', 'type']].join(df1, on=['col1','g', 'type'])
print (df)    
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

An alternate using numpy and building truth table. Logic for step 3 is a bit write once, read never. Added another test case gh

import io
import pandas as pd
import numpy as np
df = pd.read_csv(io.StringIO("""col1   | type   | date_1 | date_2 | date_3 
ab     |   A    |  -10   |    --    |  -10
ab     |   B    |  100   |   99   |  -12
cd     |   A    |   0    |  -25   |   6
cd     |   B    |  -1    |   8    |  -34
ef     |   A    |  -98   |  -9    |   0
ef     |   B    |  -7    |  -2    |   0
gh     |   A    |  -22   |  0    |   -3
gh     |   B    |  -75   |  0    |   -1

"""), sep="\s+\|\s+", engine="python").replace("--", "", regex=True)

# reshape dataframe so it's in better structure for steps
dfs = df.set_index(["col1","type"]).unstack(1)

# step 1,  identify all -ve values
a = dfs.replace("", 0).astype(int).lt(0).values
# step 2, keep where negative for type A & B,  note 2 relates to len(["A","B"])
ap = a.reshape((a.shape[0]*a.shape[1])//2, 2).all(axis=1)
a = np.repeat(ap,2).reshape(a.shape)

# step 3
# B are odd columns...
bcol = [bcol for bcol in range(a.shape[1]) if bcol%2==1]
# if A&B are both -ve for first date,  remove B value for other dates
# logic: a. A&B both -ve: a[:,[0,1]].all(axis=1),2)
#        b. logical and(not(A&B -ve), (B -ve))
a[:,bcol[1:]] = a[:,bcol[1:]]&np.repeat(~a[:,[0,1]].all(axis=1),2).reshape(len(dfs),2)

# rebuild df with truth array built for step 1&2&3
dfs.loc[:] = np.where(a, dfs, "")

# back to original shape...
df = dfs.stack().reset_index()

output

  col1 type date_1 date_2 date_3
0   ab    A                  -10
1   ab    B                  -12
2   cd    A                     
3   cd    B                     
4   ef    A    -98     -9       
5   ef    B     -7              
6   gh    A    -22            -3
7   gh    B    -75              

blank and negative A

  • replace step 1 with a more sophisticated binary logic
# col A, blanks & negative
a = (((dfs.columns.get_level_values(1)=="A") 
 & dfs.replace("",0).astype(int).le(0)) |
# col B, only negative
((dfs.columns.get_level_values(1)=="B") 
 & dfs.replace("",0).astype(int).lt(0))).values

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30