How to get the rows based on unique column values of their first occurrence

Question

I have a data frame like this:

df
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G
 1        R       S
 3        R       S

I want to get the data frame with first 3 unique value of col1. If some col1 value comes later in the df, it will ignore.

The final data frame should look like:

df
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G

How to do it most efficient way in pandas ?

@jezrael I want to keep first three unique col1 values, drop_duplicates() doesn't make ant solution, and if its duplicate please give me the link — Kallol, Mar 22 '19 at 06:43
This question is different from the drop duplicates one linked. — Nathaniel, Mar 22 '19 at 06:43

score 1 · Accepted Answer · answered Mar 22 '19 at 06:44

Create helper consecutive groups series with Series.ne, Series.shift and Series.cumsum and then filter by boolean indexing:

N = 3
df = df[df.col1.ne(df.col1.shift()).cumsum() <= N]
print (df)
   col1 col2 col3
0     1    A    B
1     1    D    R
2     2    R    P
3     2    D    F
4     3    T    G

Detail:

print (df.col1.ne(df.col1.shift()).cumsum())
0    1
1    1
2    2
3    2
4    3
5    4
6    5
Name: col1, dtype: int32

score 1 · Answer 2 · edited Mar 22 '19 at 07:25

1

here is a solution which stops at once found the three first different values

import pandas as pd
data="""
col1    col2    col3
 1        A       B
 1        D       R
 2        R       P
 2        D       F
 3        T       G
 1        R       S
 3        R       S
 """
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
nbr = 3
dico={}
for index, row in df.iterrows():
    dico[row.col1]=True
    if len(dico.keys())==nbr:
        df = df[0:index+1]
        break

print(df)

  col1 col2 col3
0     1    A    B
1     1    D    R
2     2    R    P
3     2    D    F
4     3    T    G

edited Mar 22 '19 at 07:25

jezrael

822,522
95
1,334
1,252

answered Mar 22 '19 at 07:11

Frenchy

16,386
3
16
39

@jezrael i dont speak about execution time, but fast in way of found solution.. sorry for my english – Frenchy Mar 22 '19 at 07:23
Because last OP sentence is `How to do it most efficient way in pandas ?` :) – jezrael Mar 22 '19 at 07:26

score 1 · Answer 3 · edited Mar 29 '21 at 13:53

1

You can use the duplicated method in pandas:

mask1 = df.duplicated(keep = "first") # this line is to get the first occ.
mask2 = df.duplicated(keep = False)   # this line is to get the row that occ one single time.
mask =  ~mask1 | ~mask2
df[mask]

edited Mar 29 '21 at 13:53

Tomerikoo

18,379
16
47
61

answered Mar 29 '21 at 13:50

seghair tarek

136
9

How to get the rows based on unique column values of their first occurrence

3 Answers3