How to perform cell by cell comparison using pandas?

Question

Supposing I have a df like the following,

col1   | type   | date_1 | date_2 | date_3 |.... | date_n
ab     |   A    |  -10   |        |  -10
ab     |   B    |  100   |   99   |  -12
cd     |   A    |   0    |  -25   |   6
cd     |   B    |  -1    |   8    |  -34
ab     |   A    |   98   |  -9    |   0
ab     |   B    |  -7    |  -2    |   0

First step is to remove all positive numbers including 0

Now the df should look like,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |  -10   |        |  -10   |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |  -25   |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |  -2    |        |

Second step is to compare the numbers for each 'date' col as per 'type' A and B,

If the 'type' A row has a negative number and 'type' B is blank, then remove the negative number, of 'date' col, of 'type' A
If the 'type' B row has a negative number and 'type' A is blank, then do nothing
If both types are blank do nothing

After this step, the df should look like this,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |        |        |  -10   |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |        |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |  -2    |        |

Final step,

If both types are negative for the current, for each set of col1 (ab,cd,ab), check the left-hand-side value of same Ath and Bth of the same row,

1) If both types A and B values are blank, then remove the remove the negative number of current row 'type' A and keep the -ve number of 'type' B

2) If either of the types are blank, then remove the negative of the current row 'type' B and keep the -ve number of 'type' A

Finally, the final_df should look like this,

col1   | type   | date_1 | date_2 | date_3 | .... | date_n
ab     |   A    |        |        |        |
ab     |   B    |        |        |  -12   |
cd     |   A    |        |        |        |
cd     |   B    |  -1    |        |  -34   |
ab     |   A    |        |  -9    |        |
ab     |   B    |  -7    |        |        |

For the final step, the comparison should start from the 'date_2'.

What would be the best way to solve this problem? Any help would be greatly appreciated!

Note: I cannot use the column headers (the date ones) to manipulate data because they will keep changing.

Test Data:

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [11655, 0, 0, 0, 121117, 0, 14948, 0, 0, 0, 0, 0, 0, 0, 1838, 0, 0, 0], 'Mar-21': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -16474.0, -16474.0, 7000.0, 7000.0, -19946.0, -19946.0, 16084.44444444444, 0.0, 0.0, 0.0], 'Apr-21': [104815.0, 104815.0, 17949.0, 17949.0, 96132.0, 96132.0, 0.0, 0.0, -17001.0, -33475.0, -878.0, 6122.0, 8398.0, -11548.0, -5297.073170731703, -5297.073170731703, -282.0, -282.0], 'May-21': [78260.0, 183075.0, 42557.0, 60506.0, -15265.0, 80867.0, -18.0, -18.0, 21084.0, -12391.0, -1831.0, 4291.0, 2862.0, -8686.0, 5261.25, -35.8231707317027, -369.0, -651.0], 'Jun-21': [-52480.0, 130595.0, -13258.0, 47248.0, -35577.0, 45290.0, 2434.0, 2416.0, 31147.0, 18756.0, -4310.0, -19.0, -4750.0, -13436.0, -92.0, -127.8231707317027, -280.0, -931.0], 'Jul-21': [-174544.0, -43949.0, -38127.0, 9121.0, -124986.0, -79696.0, -9707.0, -7291.0, 13577.0, 32333.0, 0.0, -19.0, -15746.0, -29182.0, 93.0, -34.8231707317027, -319.0, -1250.0], 'Aug-21': [35498.0, -8451.0, -37094.0, -27973.0, 79021.0, -675.0, -1423.0, -8714.0, 32168.0, 64501.0, 0.0, -19.0, 18702.0, -10480.0, 4347.634146341465, 4312.810975609762, -341.0, -1591.0], 'Sep-21': [44195.0, 35744.0, 2039.0, -25934.0, 70959.0, 70284.0, 2816.0, -5898.0, 38359.0, 102860.0, 0.0, -19.0, 18119.0, 7639.0, 5302.222222222219, 9615.033197831981, 0.0, -1591.0], 'Oct-21': [-13163.0, 22581.0, -4773.0, -30707.0, 205080.0, 275364.0, -709.0, -6607.0, -1397.0, 101463.0, 0.0, -19.0, 0.0, 7639.0, -34.0, 9581.033197831981, 0.0, -1591.0]}

Expected output:

{'column1': ['CT', 'CT', 'NBB', 'NBB', 'CT', 'CT', 'NBB', 'NBB', 'HHH', 'HHH', 'TP1', 'TP1', 'TPR', 'TPR', 'PP1', 'PP1', 'PP1', 'PP1'], 'column2': ['POUPOU', 'POUPOU', 'PRPRP', 'PRPRP', 'STDD', 'STDD', 'STDD', 'STDD', 'STEVT', 'STEVT', 'SYSYS', 'SYSYS', 'SYSYS', 'SYSYS', 'SHW', 'SHW', 'JV', 'JV'], 'column3': ['V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV', 'V', 'CV'], 'column4': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'column5': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'column6': ['BBB', 'BBB', 'CCC', 'CCC', 'BBB', 'BBB', 'BBB', 'BBB', 'VVV', 'VVV', 'CHCH', 'CHCH', 'CHCH', 'CHCH', 'CCC', 'CCC', 'CHCH', 'CHCH'], 'column7': ['Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Mar-21', 'Apr-21', 'Apr-21', 'Mar-21', 'Mar-21'], 'Feb-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'Mar-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, -16474.0, nan, nan, nan, -19946.0, nan, nan, nan, nan], 'Apr-21': [nan, nan, nan, nan, nan, nan, nan, nan, -17001.0, nan, nan, nan, nan, -11548.0, nan, -5297.073170731703, nan, -282.0], 'May-21': [nan, nan, nan, nan, nan, nan, nan, -18.0, nan, -12391.0, nan, nan, nan, -8686.0, nan, -35.8231707317027, -369.0, nan], 'Jun-21': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -19.0, -4750.0, nan, -92.0, nan, -280.0, nan], 'Jul-21': [nan, -43949.0, nan, nan, nan, -79696.0, nan, -7291.0, nan, nan, nan, -19.0, -15746.0, nan, nan, -34.8231707317027, -319.0, nan], 'Aug-21': [nan, -8451.0, nan, -27973.0, nan, -675.0, -1423.0, nan, nan, nan, nan, -19.0, nan, -10480.0, nan, nan, -341.0, nan], 'Sep-21': [nan, nan, nan, -25934.0, nan, nan, nan, -5898.0, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0], 'Oct-21': [nan, nan, -4773.0, nan, nan, nan, -709.0, nan, nan, nan, nan, -19.0, nan, nan, nan, nan, nan, -1591.0]}

Is possible explain more final step? why for `ef` is removed `-2` annd for `ab` `-10` ? — jezrael, Feb 04 '21 at 09:28
for final step, we start with col date_2 and compare the values with that of date_1. For ef we remove -2 because to the left, that is, values in col date_1 are negative and according to the condition mentioned, the value of the corresponding type B is removed and type A is kept. Hope this helps! — royalewithcheese, Feb 04 '21 at 09:34
@MuhammadJunaidHaris I am actually struggling with the Final Step. I thought it would be better to explain the entire problem to see if someone has a better/efficient way to approach this problem :) — royalewithcheese, Feb 04 '21 at 09:34
So it means `date_2, date_3`... are selected `A` or `B` by check `date_1` ? So if exist values there (in date_1) like `ef` then is removed `B` else removed `A` ? — jezrael, Feb 04 '21 at 09:40
no..date_2 will be compared with date_1, date_3 will be compared with date_2 and so on. Please let me know if anything else is unclear. — royalewithcheese, Feb 04 '21 at 09:43
Ok, so -10 removed, because no values in `date_2` for `ab`'? — jezrael, Feb 04 '21 at 09:52
that is correct according to second point under final step.. — royalewithcheese, Feb 04 '21 at 09:55
can you provide more details about your last steps? I have created a rough solution up till 2nd step but further details are little confusing. — k33da_the_bug, Feb 04 '21 at 10:28
for the final step, if we take date_2 col, you will see that the values for ef's A and B (date_1) are both negative, and according to the mentioned condition, type B will be removed, that is, -2. Now, if we consider date_3 col, since the cells of date_2 for ab is blank, then -10 will be removed according to the mentioned condition. — royalewithcheese, Feb 04 '21 at 11:23

jezrael · Answer 1 · 2021-02-09T10:07:46.417

2

df = pd.DataFrame(d)

rem_cols = [ 'col2', 'subtype',  'col3', 'col4', 'col5']
df['g'] = df.groupby(['col1', 'type'] ).cumcount()

df1 = df.drop(rem_cols, axis=1)

df1 = df1.set_index(['col1','g', 'type'])


df1.columns = pd.to_datetime(df1.columns)

first_date = df1.columns[0]
df1 = df1.unstack(-1)

# print (df1.stack(dropna=False).reset_index())

df1 = df1.mask(df1.ge(0))

m1 = (df1.xs('A', level=1, axis=1, drop_level=False).notna() & 
      df1.xs('B', level=1, axis=1, drop_level=False).rename(columns={'B':'A'}, level=1).isna())
m2 = (df1.xs('B', level=1, axis=1, drop_level=False).notna() &
      df1.xs('A', level=1, axis=1, drop_level=False).rename(columns={'A':'B'}, level=1).isna())

m = m1.join(m2)

df1 = df1.mask(m)
# print (df1)

df2 = df1.groupby(level=1, axis=1).shift(1, axis=1)
# print (df2)

mask1 = df1.notna() & df2.isna() & (df1.columns.get_level_values(1) == 'A')[ None, :]
#avoid change values for date_1
mask1[first_date] = False

mask2 = df1.notna() & df2.notna() & (df1.columns.get_level_values(1) == 'B')[ None, :]

df1 = df1.mask(mask1).mask(mask2).stack(dropna=False)
# print (df1)

df = df[rem_cols + ['col1','g', 'type']].join(df1, on=['col1','g', 'type'])
print (df)

edited Feb 09 '21 at 10:07

answered Feb 04 '21 at 10:38

jezrael

822,522
95
1,334
1,252

thanks for the answer, looks clean! Is there a way to make it generic though? Cause the name of the date_n cols keep changing according to the month...so `mask1['date_1'] = False` won't work. But all these date_n cols will have values as numbers or NaN always. I wonder if there is a way to make it generic. – royalewithcheese Feb 04 '21 at 13:56
there is one complication that I just noticed..How to deal with the problem when there is a possibility that there'll be a duplication of col1? And we cannot combine them as well.. – royalewithcheese Feb 05 '21 at 03:49
where would the change be made? – royalewithcheese Feb 05 '21 at 06:35
1

@royalewithcheese - Please give me some time, I change answer for correct working – jezrael Feb 05 '21 at 06:35
@royalewithcheese - Can you check my edited answer? – jezrael Feb 05 '21 at 08:08
yes, i'll check but it seems like for some reason the code doesn't work for the data I have?! I am confused. Maybe I can add a test data for your reference. I am getting all NaN values. – royalewithcheese Feb 05 '21 at 08:16
@royalewithcheese - Hmm, not idea. For me working well :( – jezrael Feb 05 '21 at 08:18
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/228293/discussion-between-royalewithcheese-and-jezrael). – royalewithcheese Feb 05 '21 at 08:19
Hi, I have updated my Test data to include the columns I was mentioning, could you please have a look as to what needs to be changed from your current solution? I tried running the Test data with your solution and found that there is some data missing mostly the duplicated values across given col1 and type. Could you please help me in running a last test on the basis of the provided data? Instead of deleting the duplicated values I want to be able to retain it. – royalewithcheese Feb 09 '21 at 01:33
any suggestions? I have even added the output I am getting – royalewithcheese Feb 09 '21 at 06:08
@royalewithcheese - Is possible add correct output? – jezrael Feb 09 '21 at 06:09
sure, please give me sometime – royalewithcheese Feb 09 '21 at 06:09
I have edited and added the correct output in the question itself. Did you happen to have a look at it? – royalewithcheese Feb 09 '21 at 07:46
@royalewithcheese - Not understand, there are some new columns, is necessary remove them? What means type, subtype? – jezrael Feb 09 '21 at 07:58
@royalewithcheese - also what was changed in my solution? With my solution is impossible get `wrong ouput`. – jezrael Feb 09 '21 at 08:01
cannot remove the new columns, that was actually the original df but thought of removing the unnecessary because I thought I could merge the df's later on the basis of col1 and type but there seems to multiple matches for these two rows so I added all cols in the question now. – royalewithcheese Feb 09 '21 at 08:06
regarding your solution, I changed `first_date = df1.columns[3]` and `dfn = df.select_dtypes('number') df[dfn.columns] = dfn.mask(dfn.ge(0))` – royalewithcheese Feb 09 '21 at 08:08
subtype is like a codeword for type column @jezrael – royalewithcheese Feb 09 '21 at 08:12
@royalewithcheese - Hmmm, problem is your change completely broke my solution :( – jezrael Feb 09 '21 at 08:13
@royalewithcheese - Solution need only `col1, type` and datetimes columns. If add new columns it broke it. – jezrael Feb 09 '21 at 08:14
so there is no way to apply some changes to your solution or could you take this as my new question? – royalewithcheese Feb 09 '21 at 08:15
@royalewithcheese - Yes, remove unnecessary columns, apply my solution and merge unnecessary columns should working. – jezrael Feb 09 '21 at 08:16
yeah i tried merging it back but end up with duplicated rows..could u give it a shot? – royalewithcheese Feb 09 '21 at 08:17
and it doesn't actually break your solution completely just some data for the unnecessary columns go missing – royalewithcheese Feb 09 '21 at 08:24
@royalewithcheese - Ok, problem is if use `unstack` then get different DataFrame if use 2 columns col1, type and if use N columns. So it is reason for different output. – jezrael Feb 09 '21 at 08:26
is it possible for you to approach this problem in a new way then? I can edit the entire question if required. – royalewithcheese Feb 09 '21 at 08:29
1

@royalewithcheese - Not easy working with data, but I can try it. – jezrael Feb 09 '21 at 08:29
maybe try merging first? – royalewithcheese Feb 09 '21 at 08:31
@royalewithcheese - there are `type col2 col3` columns which are not use in my solution? – jezrael Feb 09 '21 at 08:32
you mean `subtype`? – royalewithcheese Feb 09 '21 at 08:33
`subtype is like a codeword for type column` ? It means in my solution is use `col1, type` or `col1, subtype` ? – jezrael Feb 09 '21 at 08:34
1

for your solution - col1, type – royalewithcheese Feb 09 '21 at 08:35
i just changed the two lines code from your solution which I have mentioned before apart from that have kept everything else the same.. – royalewithcheese Feb 09 '21 at 08:41
1

@royalewithcheese - Just thinking about not merge alternative, working on it. Btw, there is alot columns without `type, col1` and datetimes columns? – jezrael Feb 09 '21 at 08:42
there is but the cols that I have added should cover the test cases – royalewithcheese Feb 09 '21 at 08:44
let me edit and put the test data with all cols, i think it will be better – royalewithcheese Feb 09 '21 at 08:48
I have added the data that I am working with for your reference. – royalewithcheese Feb 09 '21 at 09:05
also a heads up, the 'datetime' like column names have to be generic because they keep changing.. – royalewithcheese Feb 09 '21 at 09:13
1

@royalewithcheese - sure ;) – jezrael Feb 09 '21 at 09:13
any progress @jezrael? I think merging won't work because there are duplicate values for a given `col1` and `type` – royalewithcheese Feb 09 '21 at 09:29
1

@royalewithcheese - Still it not reduced rows after first `.unstack()`, investigate why. – jezrael Feb 09 '21 at 09:32
Found problem, is necessary drop columns. – jezrael Feb 09 '21 at 09:48
otherwise not possible? and could you explain the problem? – royalewithcheese Feb 09 '21 at 09:59
@royalewithcheese - Is possible go to chat? – jezrael Feb 09 '21 at 10:01
hello, it seems like there is some discrepancy in the final output. Could you please please help me out? I tried your making minor tweaks in your solution, but I am really stuck. I shall upload the test data and the expected output data as well by tomorrow. – royalewithcheese Feb 28 '21 at 15:38
1

@royalewithcheese - Can you create new question? – jezrael Mar 01 '21 at 05:27
https://stackoverflow.com/questions/66422275/how-to-manipulate-data-cell-by-cell-in-pandas-df – royalewithcheese Mar 01 '21 at 12:37
were you able to see df2? I have added a link to my question above – royalewithcheese Mar 01 '21 at 13:02
@royalewithcheese - I think it is really complicated, in this answer I use 3 hours, really not easy. So I suggest break new questions to more parts, simpliest like possible. – jezrael Mar 01 '21 at 13:05
@royalewithcheese - Now I was lost in new question, unfortunately. – jezrael Mar 01 '21 at 13:06
It is the same question, just explained with examples to give a better understanding. You can try the same code for df1 and check the result for df2 as there are some values are either missing or have not been included. – royalewithcheese Mar 01 '21 at 13:11
@royalewithcheese - hmmm, I think missing expected otput. – jezrael Mar 01 '21 at 13:12
I think the problem occurred in Step 4. Could you please check that part only, the rest should be same. – royalewithcheese Mar 01 '21 at 13:13
@royalewithcheese - Also picture are super, but for more complicated data is much better add part from inout data - not only negative, but what is negative. – jezrael Mar 01 '21 at 13:13
they're negative numbers. – royalewithcheese Mar 01 '21 at 13:14
you cannot see df2? I added a link in the question..Please check – royalewithcheese Mar 01 '21 at 13:14
@royalewithcheese - I miss it. – jezrael Mar 01 '21 at 13:15
@royalewithcheese - So what row and column is wrong? – jezrael Mar 01 '21 at 13:16
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229343/discussion-between-jezrael-and-royalewithcheese). – jezrael Mar 01 '21 at 13:17

Rob Raymond · Answer 2 · 2021-03-04T10:14:03.077

An alternate using numpy and building truth table. Logic for step 3 is a bit write once, read never. Added another test case gh

import io
import pandas as pd
import numpy as np
df = pd.read_csv(io.StringIO("""col1   | type   | date_1 | date_2 | date_3 
ab     |   A    |  -10   |    --    |  -10
ab     |   B    |  100   |   99   |  -12
cd     |   A    |   0    |  -25   |   6
cd     |   B    |  -1    |   8    |  -34
ef     |   A    |  -98   |  -9    |   0
ef     |   B    |  -7    |  -2    |   0
gh     |   A    |  -22   |  0    |   -3
gh     |   B    |  -75   |  0    |   -1

"""), sep="\s+\|\s+", engine="python").replace("--", "", regex=True)

# reshape dataframe so it's in better structure for steps
dfs = df.set_index(["col1","type"]).unstack(1)

# step 1,  identify all -ve values
a = dfs.replace("", 0).astype(int).lt(0).values
# step 2, keep where negative for type A & B,  note 2 relates to len(["A","B"])
ap = a.reshape((a.shape[0]*a.shape[1])//2, 2).all(axis=1)
a = np.repeat(ap,2).reshape(a.shape)

# step 3
# B are odd columns...
bcol = [bcol for bcol in range(a.shape[1]) if bcol%2==1]
# if A&B are both -ve for first date,  remove B value for other dates
# logic: a. A&B both -ve: a[:,[0,1]].all(axis=1),2)
#        b. logical and(not(A&B -ve), (B -ve))
a[:,bcol[1:]] = a[:,bcol[1:]]&np.repeat(~a[:,[0,1]].all(axis=1),2).reshape(len(dfs),2)

# rebuild df with truth array built for step 1&2&3
dfs.loc[:] = np.where(a, dfs, "")

# back to original shape...
df = dfs.stack().reset_index()

output

  col1 type date_1 date_2 date_3
0   ab    A                  -10
1   ab    B                  -12
2   cd    A                     
3   cd    B                     
4   ef    A    -98     -9       
5   ef    B     -7              
6   gh    A    -22            -3
7   gh    B    -75

blank and negative A

replace step 1 with a more sophisticated binary logic

# col A, blanks & negative
a = (((dfs.columns.get_level_values(1)=="A") 
 & dfs.replace("",0).astype(int).le(0)) |
# col B, only negative
((dfs.columns.get_level_values(1)=="B") 
 & dfs.replace("",0).astype(int).lt(0))).values

I have given extensive explanation of step 3, you may have a look, if anything is unclear please let me know. Also, I made an edit to the question. — royalewithcheese, Feb 05 '21 at 03:50
updated - logic for step 3 works but is not very readable... — Rob Raymond, Feb 05 '21 at 09:11
what would I have to change if for step 2, I want to keep the negative number for B should there be a situation where B has a negative number but A is blank? — royalewithcheese, Mar 04 '21 at 02:35
Thanks for the update. When I run your code using the test data (which I have added in the question), I get the following error at the last step `operands could not be broadcast together with shapes (12,8) (12,2)` — royalewithcheese, Mar 07 '21 at 09:17
Could you please try to run your algo using my test data perhaps? I have also added the expected output. — royalewithcheese, Mar 07 '21 at 09:19

How to perform cell by cell comparison using pandas?

2 Answers2

output

blank and negative A