How to delete rows having same value in more than 3 columns

Question

I have below Data Frame.

A   B   C   D   E   F   G
1   4   9   4   6   9   8
2   2   2   2   2   5   9
2   2   2   2   2   2   2
2   6   9   5   4   4   5
2   8   1   9   5   8   9
2   2   2   5   6   3   6

I need output as below:

A   B   C   D   E   F   G
1   4   9   4   6   9   8
2   6   9   5   4   4   5
2   8   1   9   5   8   9
2   2   2   5   6   3   6

It means rows having more than three columns as same value should be deleted. We can see in the Second and Third rows are having 5 and 7 columns as same value respectively . We need to delete those rows.

Could any please help me.

`same value in more than 3 columns` - in sequence or any order? — Divakar, Oct 08 '18 at 12:08
Question has nothing to do with `numpy` or `machine-learning` - kindly do not spam the tags (removed). — desertnaut, Oct 08 '18 at 12:09
@desertnaut Well pandas dataframes have NumPy array as the underlying data. So NumPy might be relevant. Also, for performance it's useful. — Divakar, Oct 08 '18 at 12:11

score 2 · Answer 1 · answered Oct 08 '18 at 12:26

Here's a naïve Pandas loop via pd.DataFrame.apply and pd.Series.value_counts:

def max_count(s):
    return s.value_counts().values[0]

res = df[df.apply(max_count, axis=1).le(3)]

print(res)

   A  B  C  D  E  F  G
0  1  4  9  4  6  9  8
3  2  6  9  5  4  4  5
4  2  8  1  9  5  8  9
5  2  2  2  5  6  3  6

Divakar · Answer 2 · 2018-10-08T12:34:45.150

1

Approach #1

For dataframe with ints, here's a vectorized one with bincount -

# https://stackoverflow.com/a/46256361/ @Divakar
def bincount2D_vectorized(a):    
    N = a.max()+1
    a_offs = a + np.arange(a.shape[0])[:,None]*N
    return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)

out = df[(bincount2D_vectorized(df.values)<=3).all(1)]

Sample output -

In [563]: df[(bincount2D_vectorized(df.values)<=3).all(1)]
Out[563]: 
   A  B  C  D  E  F  G
0  1  4  9  4  6  9  8
3  2  6  9  5  4  4  5
4  2  8  1  9  5  8  9
5  2  2  2  5  6  3  6

edited Oct 08 '18 at 12:34

answered Oct 08 '18 at 12:28

Divakar

218,885
19
262
358

Is there an approach #2? I'm interested :) – jpp Oct 08 '18 at 12:43

score 0 · Answer 3 · answered Oct 08 '18 at 12:16

You can use a set which has only unique values. If a row has 3 equal values, then len(set(row)) = len(row) - 2. Iterate over the dataframe to find those rows and store their indexes.

indexes_to_remove = []
for index, row in df.iterrows():
    if len(set(row)) < len(row) - 2:
        indexes_to_remove.append(index)

Then you can remove them safely.

How to delete rows having same value in more than 3 columns

3 Answers3