2

The title may not be super clear. What I want to do is the following.

I have the following dataframe:

df = pd.DataFrame(
    {
        "id": ["1", "2", "3", "1", "4", "5", "2", "6", "3", "1", "4"],
        "value": ["A", "A", "B", "B", "B", "C", "C", "A", "A", "D", "A"],
    },
    index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
)

Using this data frame I'd like to create a new data frame with the rows those appear for the first time the with respect to the column "id". So, it would mean the rows with the indices: 0,1,2,4,5 and 7.

I hope the problem is expressed clear enough. Thanks.

Bora
  • 41
  • 4
  • 1
    Similar: [pandas group by and find first non null value for all columns](https://stackoverflow.com/q/59048308/15497888) – Henry Ecker Jul 05 '21 at 18:35

2 Answers2

3

If you want to retain the indices as you mention

You can do a reverse of series.duplicated on id by using a ~ and then a boolean masking:

df[~df['id'].duplicated()]

  id value
0  1     A
1  2     A
2  3     B
4  4     B
5  5     C
7  6     A
anky
  • 74,114
  • 11
  • 41
  • 70
2

Try:

print(df.groupby("id", as_index=False).first())

Prints:

  id value
0  1     A
1  2     A
2  3     B
3  4     B
4  5     C
5  6     A
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91