How do I keep the first entry from consecutive entries in a DataFrame?

Question

With the following DataFrame:

import pandas as pd
import random
random.seed(3)
df = pd.DataFrame( 
  data=[random.sample(["A","B"],1) for i in range(6)],
  columns=["category"] )

We get:

How do I get only the first row for each consecutive category group?

Note: the data can contain an arbitrary number of repeats - I only want the first of each consecutive group.

Expected would be:

  category
0  A
2  B
4  A

I hoped that the sort flag from groupby() would solve this, but it nevertheless treats all occurences of category as a group - not consecutive ones:

df.groupby("category").head(1)

As I am learning pandas and my DataFrame can become very large I'm searching for a pandas native solution and not iterating over the array or DataFrame.

While the answers from Make Pandas groupby act similarly to itertools groupby can be applied here, the posed question is different. As such I would leave this question open so it's easier to find an answer.

Nope. Can be any amount of consecutive entries. In the real world I try to get state changes events that are thrown from other fields as well. — Pascal, Feb 01 '23 at 10:10
this hsould work `df[(df.ne(df.shift())).any(axis=1)]` since the question is closed I cannot post the answer — Lucas M. Uriarte, Feb 01 '23 at 10:13

score 1 · Accepted Answer · answered Feb 01 '23 at 10:09

This is probably the best solution:

(df
    .assign(equal_to_previous = lambda x: x['category']==x['category'].shift(1))
    .loc[lambda x: ~x['equal_to_previous']]
    .drop(columns=['equal_to_previous'])
)

Or just

df.loc[lambda x: x['category']!=x['category'].shift(1)]

The trick is to filter on the category not being the same as the previous row's.

How do I keep the first entry from consecutive entries in a DataFrame?

1 Answers1