2

Given a pandas.DataFrame:

    | Col_A | Col_B | Other_Columns
0   | A0    | B0    | …
1   | A1    | B1    | …
2   | A2    | B2    | …
3   | A3    | B3    | …
…   | …     | …     | …

I have been trying to keep a subset of this DataFrame: getting rid of rows where both A and B entries are unique (e.g. if on row 6 say, both values A6 and B6 do not appear anywhere else on the DataFrame, we want to remove this row)

I would not like to drop duplicates. Also, I would not like to get unique values (which, if I understand well would be similar to converting a list to a set, am I right?) but instead the values that appear once only.

At this point, this is what I have got:

counts = df[['Col_A','Col_B']].stack().value_counts(ascending=True)
myList = [0] + [item for item in counts[counts.values == 1].index]
toRemove = []
for i in df.index:
    if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList:
        toRemove.append(i)
final_df = df[~df.index.isin(toRemove)]

This is not very efficient (the data frame is pretty large >10M lines) There must be a more pythonic strategy, involving builtin features of pandas, right? Also, I am not too sure the first line is correct: by stacking both columns, am I making sure to perform the count over the entries of both columns?

Please do not hesitate if you need more information or if my writing is not clear to you.

Many thanks for taking the time :-)

1 Answers1

1

what about something like this:

In [75]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('abc'))

In [76]: df
Out[76]:
    a   b   c
0  37  85  17
1  19   0  11
2  51  20  65
3  59  92  65
4  48  15  91
5  21  50  44
6  61  94  49
7  51   6  88
8  89  72  40
9   5  51  79

In [77]: c = df[['a','b']].stack().value_counts()

In [78]: c
Out[78]:
51    3
94    1
15    1
37    1
6     1
72    1
50    1
21    1
5     1
48    1
61    1
19    1
20    1
85    1
89    1
59    1
92    1
0     1
dtype: int64

In [79]: c[c>1]
Out[79]:
51    3
dtype: int64

In [80]: vals = c[c>1].index

In [81]: df[(df['a'].isin(vals)) | (df['b'].isin(vals))]
Out[81]:
    a   b   c
2  51  20  65
7  51   6  88
9   5  51  79

UPDATE:

when you do the if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList: check you are checking it not quite correctly...

here is what is happening:

In [90]: df.at[0, 'a'], df.at[0, 'b']
Out[90]: (37, 85)

In [91]: (df.at[0, 'a'] and df.at[0, 'b'])
Out[91]: 85

So you can't check it this way

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Many thanks MaxU, your suggested answer is clear and your code fast. I have been playing with it, yet one doubt remains: I get a completely different output if I try to do it with the slow approach of looping over data frame indices: `keepIndex = []` `for i in df.index: if df.at[i,’a’] or df.at[i,’b’] in vals: keepIndex.append(i)` `final_df = df[df.index.isin(keepIndex)]` Is my code doing a completely other thing? Am I missing something out here? – UniversalTraveller May 09 '16 at 15:14
  • @UniversalTraveller, please read "UPDATE:" section in my answer – MaxU - stand with Ukraine May 09 '16 at 15:25
  • Ok, many thanks for clarification :-) By any chance do you know why? – UniversalTraveller May 09 '16 at 15:37
  • 1
    @UniversalTraveller, glad to help. When you run `1 or 3`, python will check whether __at least__ one number is True (i.e. !=0 ), so it will return `1`. If you execute `1 and 3`, python will check whether __both__ numbers are True ( != 0 ), so it'll return the last one- `3`. This is happening when you do `if (df.at[i,’a’] or df.at[i,’b’])` - basically it'll be translated to `if ()`. I would suggest you to play with it: `print(3 or 5); print(3 and 5)` – MaxU - stand with Ukraine May 09 '16 at 15:41