1

I have below Data Frame.

A   B   C   D   E   F   G
1   4   9   4   6   9   8
2   2   2   2   2   5   9
2   2   2   2   2   2   2
2   6   9   5   4   4   5
2   8   1   9   5   8   9
2   2   2   5   6   3   6

I need output as below:

A   B   C   D   E   F   G
1   4   9   4   6   9   8
2   6   9   5   4   4   5
2   8   1   9   5   8   9
2   2   2   5   6   3   6

It means rows having more than three columns as same value should be deleted. We can see in the Second and Third rows are having 5 and 7 columns as same value respectively . We need to delete those rows.

Could any please help me.

jpp
  • 159,742
  • 34
  • 281
  • 339
PANDA
  • 137
  • 2
  • 9

3 Answers3

2

Here's a naïve Pandas loop via pd.DataFrame.apply and pd.Series.value_counts:

def max_count(s):
    return s.value_counts().values[0]

res = df[df.apply(max_count, axis=1).le(3)]

print(res)

   A  B  C  D  E  F  G
0  1  4  9  4  6  9  8
3  2  6  9  5  4  4  5
4  2  8  1  9  5  8  9
5  2  2  2  5  6  3  6
jpp
  • 159,742
  • 34
  • 281
  • 339
1

Approach #1

For dataframe with ints, here's a vectorized one with bincount -

# https://stackoverflow.com/a/46256361/ @Divakar
def bincount2D_vectorized(a):    
    N = a.max()+1
    a_offs = a + np.arange(a.shape[0])[:,None]*N
    return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)

out = df[(bincount2D_vectorized(df.values)<=3).all(1)]

Sample output -

In [563]: df[(bincount2D_vectorized(df.values)<=3).all(1)]
Out[563]: 
   A  B  C  D  E  F  G
0  1  4  9  4  6  9  8
3  2  6  9  5  4  4  5
4  2  8  1  9  5  8  9
5  2  2  2  5  6  3  6
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

You can use a set which has only unique values. If a row has 3 equal values, then len(set(row)) = len(row) - 2. Iterate over the dataframe to find those rows and store their indexes.

indexes_to_remove = []
for index, row in df.iterrows():
    if len(set(row)) < len(row) - 2:
        indexes_to_remove.append(index)

Then you can remove them safely.

shnorkel
  • 25
  • 1
  • 5