1

I wish to remove a subset of data from my original dataframe.

Subset data: Mismatch_test_final: 141 columns, 14222 rows    
Main data: X_TNR_final: 140 columns, 132252 rows

Example of what I want to achieve:

X_TNR_final= pd.DataFrame({'k': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
Mismatch_test_final = pd.DataFrame({'k': ['foo'],
                    'value': [5]})

leftover= df1.merge(df2,how='left',indicator=True)
answer = leftover.loc[leftover['_merge']=='left_only']

Expected output

df1:
k   value
foo  1
bar  2
baz  3
foo  5

df2:
k   value
foo   5

answer:
k   value
foo  1
bar  2
baz  3

I have referred to other threads like How to remove a subset of a data frame in Python? but it is somehow not working for me.

Approach1:

I remove the one extra column in the subset and use pandas merge with indicator=True


remaining_TNR_Test = Test_TNR_final.merge(Mismatch_test_final.drop(['TPR_1'],axis=1), how='outer',indicator=True)
remaining_TNR_Test_final = remaining_TNR_Test[remaining_TNR_Test['_merge']=='left_only']

The output I get has more number of rows than is expected indicating that the removal did not happen correctly.

Actual output: 127794 rows, 140 columns  
Expected output: 118030 rows (132252-14222), 140 columns

Approach 2: I also tried using the 'isin' operator

remaining_TNR_Test_dummy=Test_TNR_final[~(Test_TNR_final.isin(Mismatch_test_final.drop(['TPR_1'],axis=1)).all(axis=1))]

When I use this technique, the number of rows remains unchanged. That is no reduction takes place.

Actual output: 132252 rows, 140 columns  
Expected output: 118030 rows (132252-14222), 140 columns

Can someone please help me with this? Highly appreciate it! Thanks

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • 2
    Please include a small sample of your dataframes along with your desired results. Take a look at [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Shubham Sharma Jun 25 '20 at 12:56
  • 1
    Thank you, I have added a small sample and desired results. hope it's alright, I'm new here – Sakshi Jajodia Jun 25 '20 at 13:13
  • 1
    I guess using `leftover= df1.merge(df2,how='left',indicator=True)` should work for you. What problem are you facing using this technique? – Shubham Sharma Jun 25 '20 at 13:26
  • 1
    Cannot reproduce error. Your example code produces the expected output. Probably the problem is in the data not in the code. – above_c_level Jun 26 '20 at 06:33

0 Answers0