3

I have a data frame like this:

df
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G
 1        R       S
 3        R       S

I want to get the data frame with first 3 unique value of col1. If some col1 value comes later in the df, it will ignore.

The final data frame should look like:

df
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G

How to do it most efficient way in pandas ?

Kallol
  • 2,089
  • 3
  • 18
  • 33
  • @jezrael I want to keep first three unique col1 values, drop_duplicates() doesn't make ant solution, and if its duplicate please give me the link – Kallol Mar 22 '19 at 06:43
  • 2
    This question is different from the drop duplicates one linked. – Nathaniel Mar 22 '19 at 06:43

3 Answers3

1

Create helper consecutive groups series with Series.ne, Series.shift and Series.cumsum and then filter by boolean indexing:

N = 3
df = df[df.col1.ne(df.col1.shift()).cumsum() <= N]
print (df)
   col1 col2 col3
0     1    A    B
1     1    D    R
2     2    R    P
3     2    D    F
4     3    T    G

Detail:

print (df.col1.ne(df.col1.shift()).cumsum())
0    1
1    1
2    2
3    2
4    3
5    4
6    5
Name: col1, dtype: int32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

here is a solution which stops at once found the three first different values

import pandas as pd
data="""
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G
 1        R       S
 3        R       S
 """
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
nbr = 3
dico={}
for index, row in df.iterrows():
    dico[row.col1]=True
    if len(dico.keys())==nbr:
        df = df[0:index+1]
        break

print(df)

  col1 col2 col3
0     1    A    B
1     1    D    R
2     2    R    P
3     2    D    F
4     3    T    G
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Frenchy
  • 16,386
  • 3
  • 16
  • 39
1

You can use the duplicated method in pandas:

mask1 = df.duplicated(keep = "first") # this line is to get the first occ.
mask2 = df.duplicated(keep = False)   # this line is to get the row that occ one single time.
mask =  ~mask1 | ~mask2
df[mask]
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61