Given a pandas.DataFrame
:
| Col_A | Col_B | Other_Columns
0 | A0 | B0 | …
1 | A1 | B1 | …
2 | A2 | B2 | …
3 | A3 | B3 | …
… | … | … | …
I have been trying to keep a subset of this DataFrame
: getting rid of rows where both A
and B
entries are unique (e.g. if on row 6 say, both values A6
and B6
do not appear anywhere else on the DataFrame
, we want to remove this row)
I would not like to drop duplicates. Also, I would not like to get unique values (which, if I understand well would be similar to converting a list to a set, am I right?) but instead the values that appear once only.
At this point, this is what I have got:
counts = df[['Col_A','Col_B']].stack().value_counts(ascending=True)
myList = [0] + [item for item in counts[counts.values == 1].index]
toRemove = []
for i in df.index:
if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList:
toRemove.append(i)
final_df = df[~df.index.isin(toRemove)]
This is not very efficient (the data frame is pretty large >10M lines) There must be a more pythonic strategy, involving builtin features of pandas, right? Also, I am not too sure the first line is correct: by stacking both columns, am I making sure to perform the count over the entries of both columns?
Please do not hesitate if you need more information or if my writing is not clear to you.
Many thanks for taking the time :-)