I wish to remove a subset of data from my original dataframe.
Subset data: Mismatch_test_final: 141 columns, 14222 rows
Main data: X_TNR_final: 140 columns, 132252 rows
Example of what I want to achieve:
X_TNR_final= pd.DataFrame({'k': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
Mismatch_test_final = pd.DataFrame({'k': ['foo'],
'value': [5]})
leftover= df1.merge(df2,how='left',indicator=True)
answer = leftover.loc[leftover['_merge']=='left_only']
Expected output
df1:
k value
foo 1
bar 2
baz 3
foo 5
df2:
k value
foo 5
answer:
k value
foo 1
bar 2
baz 3
I have referred to other threads like How to remove a subset of a data frame in Python? but it is somehow not working for me.
Approach1:
I remove the one extra column in the subset and use pandas merge with indicator=True
remaining_TNR_Test = Test_TNR_final.merge(Mismatch_test_final.drop(['TPR_1'],axis=1), how='outer',indicator=True)
remaining_TNR_Test_final = remaining_TNR_Test[remaining_TNR_Test['_merge']=='left_only']
The output I get has more number of rows than is expected indicating that the removal did not happen correctly.
Actual output: 127794 rows, 140 columns
Expected output: 118030 rows (132252-14222), 140 columns
Approach 2: I also tried using the 'isin' operator
remaining_TNR_Test_dummy=Test_TNR_final[~(Test_TNR_final.isin(Mismatch_test_final.drop(['TPR_1'],axis=1)).all(axis=1))]
When I use this technique, the number of rows remains unchanged. That is no reduction takes place.
Actual output: 132252 rows, 140 columns
Expected output: 118030 rows (132252-14222), 140 columns
Can someone please help me with this? Highly appreciate it! Thanks