0

With the following DataFrame:

import pandas as pd
import random
random.seed(3)
df = pd.DataFrame( 
  data=[random.sample(["A","B"],1) for i in range(6)],
  columns=["category"] )

We get:

enter image description here

How do I get only the first row for each consecutive category group?

Note: the data can contain an arbitrary number of repeats - I only want the first of each consecutive group.

Expected would be:

  category
0  A
2  B
4  A

I hoped that the sort flag from groupby() would solve this, but it nevertheless treats all occurences of category as a group - not consecutive ones:

df.groupby("category").head(1)

enter image description here

As I am learning pandas and my DataFrame can become very large I'm searching for a pandas native solution and not iterating over the array or DataFrame.

While the answers from Make Pandas groupby act similarly to itertools groupby can be applied here, the posed question is different. As such I would leave this question open so it's easier to find an answer.

Pascal
  • 2,197
  • 3
  • 24
  • 34

1 Answers1

1

This is probably the best solution:

(df
    .assign(equal_to_previous = lambda x: x['category']==x['category'].shift(1))
    .loc[lambda x: ~x['equal_to_previous']]
    .drop(columns=['equal_to_previous'])
)

Or just

df.loc[lambda x: x['category']!=x['category'].shift(1)]

The trick is to filter on the category not being the same as the previous row's.

MYK
  • 1,988
  • 7
  • 30