Getting rows for increasing values of a column in each group in grouped dataframe

Question

My dataframe df is:

data = {'Election Year':['2000', '2000','2000','2000','2000','2000','2000','2000','2000','2005','2005','2005','2005','2005','2005','2005','2005','2005', '2010', '2010','2010','2010','2010','2010','2010','2010', '2010'],
    'Votes':[30, 50, 20, 26, 30, 45, 20, 46, 80, 60, 46, 95, 60, 10, 95, 16, 65, 35, 50, 100, 70, 26, 180, 100, 120, 46, 80], 
    'Party': ['A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C','A', 'B', 'C','A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C','A', 'B', 'C'],
    'Region': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c','a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c','a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c']}
df = pd.DataFrame(data)
df

    
    Election Year   Votes   Party   Region
  0   2000           30      A       a
  1   2000           50      B       a
  2   2000           20      C       a
  3   2000           26      A       b
  4   2000           30      B       b
  5   2000           45      C       b 
  6   2000           20      A       c
  7   2000           46      B       c
  8   2000           80      C       c
  9   2005           60      A       a
  10  2005           66      B       a
  11  2005           95      C       a
  12  2005           60      A       b
  13  2005           10      B       b
  14  2005           95      C       b
  15  2005           16      A       c
  16  2005           65      B       c
  17  2005           35      C       c
  18  2010           50      A       a
  19  2010           100     B       a
  20  2010           70      C       a
  21  2010           26      A       b
  22  2010           180     B       b
  23  2010           100     C       b 
  24  2010           120     A       c
  25  2010           46      B       c
  26  2010           80      C       c

I want to the Regions in which top two parties (in terms of getting maximum aggregate votes) of 2000 have continuously increased their votes in the next 2 elections.. So the desired output is:

   Party   Region
     B        a
     B        c
     C        b

First I tried to get the top two parties based on aggregate total votes of year 2000. This is giving Party "C" and "B".

df1=df['Election Year'].eq('2000')
top_2=df[m].groupby(['Election Year','Party'],as_index=False) 
       ['Votes'].sum().sort_values('Votes',ascending=False).head(2)['Party'].values
top_2

This gives Party "C" and "B".

Now how can I check for the regions in which Votes of these parties have increased in subsequent years?

with my approach you will not able to get Region `c` because it's votes is less than that of year 2000 — Anurag Dabas, Jun 11 '21 at 15:45
join this discussion https://chat.stackoverflow.com/rooms/233666/https-stackoverflow-com-questions-67937940-getting-rows-for-increasing-values-o — Anurag Dabas, Jun 11 '21 at 15:50

Rob Raymond · Answer 1 · 2021-06-11T16:02:29.490

first establish two parties with highest votes in 2010
then for future years, analyse party/region combinations for ascending votes and total votes
finally pick out party/region pairings that meet criteria
this logic from what I can see works but gives a different output to yours

data = {'Election Year':['2000', '2000','2000','2000','2000','2000','2000','2000','2000','2005','2005','2005','2005','2005','2005','2005','2005','2005', '2010', '2010','2010','2010','2010','2010','2010','2010', '2010'],
    'Votes':[30, 50, 20, 26, 30, 45, 20, 46, 80, 60, 46, 95, 60, 10, 95, 16, 65, 35, 50, 100, 70, 26, 180, 100, 120, 46, 80], 
    'Party': ['A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C','A', 'B', 'C','A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C','A', 'B', 'C'],
    'Region': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c','a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c','a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c']}
df = pd.DataFrame(data)
# get top parties in 2000
topp = (
    df.loc[df["Election Year"].eq("2000")].groupby("Party").agg({"Votes": "sum"})
    .sort_values("Votes", ascending=False)
    .head(2)
)


def f(df):
    # test that year by year the votes are increasing.  also provide total votes for final step
    return pd.Series(
        {
            "ascending": (~df["Votes"].lt(df["Votes"].shift())).all(),
            "Votes": df["Votes"].sum(),
        }
    )


df2 = (
    df.loc[df["Party"].isin(topp.index) &
           df["Election Year"].gt("2000")].groupby(
               ["Party", "Region"], as_index=False
           ).apply(f)
)

# final step - filter those with ascending votes and pick out part/region that have most votes
df2.loc[df2["ascending"]].sort_values(["Region", "Votes"], ascending=[1, 0]).groupby("Region", as_index=False).first()

output

	Region	Party	ascending	Votes
0	a	B	True	146
1	b	C	True	195
2	c	C	True	115

open the logic

the complex step is finding Party and Region combinations where votes are increasing
this is an alternate approach that provides more transparency. It uses named aggregations instead of a apply() plus outputs more of the data that is being considered

df2 = (
    df.loc[df["Party"].isin(topp.index) & df["Election Year"].gt("2000")]
    .groupby(["Party", "Region"], as_index=False)
    .agg(
        VoteDtl=("Votes", list),
        Votes=("Votes", "sum"),
        VotesS=("Votes", lambda s: s.shift(-1).fillna(10 ** 6).tolist()),
        Asc=("Votes", lambda s: s.shift(-1).fillna(10 ** 6).gt(s).all()),
    )
)

df2

	Party	Region	VoteDtl	Votes	VotesS	Asc
0	B	a	[46, 100]	146	[100.0, 1000000.0]	True
1	B	b	[10, 180]	190	[180.0, 1000000.0]	True
2	B	c	[65, 46]	111	[46.0, 1000000.0]	False
3	C	a	[95, 70]	165	[70.0, 1000000.0]	False
4	C	b	[95, 100]	195	[100.0, 1000000.0]	True
5	C	c	[35, 80]	115	[80.0, 1000000.0]	True

Thanks for your time and answer. Could it be solved in a simpler way? I am getting overwhelmed by this. — Dpk, Jun 11 '21 at 15:15
I've updated to provide more transparency on groupby / aggregate phase to find combinations where votes are increasing — Rob Raymond, Jun 11 '21 at 16:04

tomjn · Answer 2 · 2021-06-12T19:39:22.657

I like the answer of Rob Raymond, but I just want to highlight a couple more things that might be useful in solving this kind of problem. For this kind of analysis it is nice to look at the data with the "Party" and "Region" columns as indices.

If we do

grps = ["Party", "Region"]  # I do this because we will use these later
df = df.set_index(grps).sort_index()

Then df now looks like

             Election Year  Votes
Party Region
A     a               2000     30
      a               2005     60
      a               2010     50
      b               2000     26
      b               2005     60
      b               2010     26
      c               2000     20
      c               2005     16
      c               2010    120
B     a               2000     50
      a               2005     46
      a               2010    100
      b               2000     30
      b               2005     10
      b               2010    180
      c               2000     46
      c               2005     65
      c               2010     46
C     a               2000     20
      a               2005     95
      a               2010     70
      b               2000     45
      b               2005     95
      b               2010    100
      c               2000     80
      c               2005     35
      c               2010     80

I think this is easier to study by eye for some cross-checking e.g according to my understanding of your question, the only party/region where the number of votes increases each year is "C/b" where there were 45 votes in 2000, 95 votes in 2005 and 100 votes in 2010.

But what if the data is too large to look at. Well then we can group by the new index (remember this is now "Party" AND "Region") and apply the diff method to the "Votes" column. We'll assign the result of this back to a new column called "Vote Diff".

df["Vote Diff"] = df.groupby(grps)["Votes"].diff()

Now df is

             Election Year  Votes  Vote Diff
Party Region
A     a               2000     30        NaN
      a               2005     60       30.0
      a               2010     50      -10.0
      b               2000     26        NaN
      b               2005     60       34.0
      b               2010     26      -34.0
      c               2000     20        NaN
      c               2005     16       -4.0
      c               2010    120      104.0
B     a               2000     50        NaN
      a               2005     46       -4.0
      a               2010    100       54.0
      b               2000     30        NaN
      b               2005     10      -20.0
      b               2010    180      170.0
      c               2000     46        NaN
      c               2005     65       19.0
      c               2010     46      -19.0
C     a               2000     20        NaN
      a               2005     95       75.0
      a               2010     70      -25.0
      b               2000     45        NaN
      b               2005     95       50.0
      b               2010    100        5.0
      c               2000     80        NaN
      c               2005     35      -45.0
      c               2010     80       45.0

Now we can easily see when the vote rose and fell. For what you want to do we don't care too much about the year 2000 for now so we can safely drop rows with a NaN using drona for the next part.

We now need to filter groups of Party/Region, only keeping those where all values of "Vote diff" in a group are positive (i.e. the vote increased each year). We can do that using filter on the groupby object. We need a small function to test whether this is true, here I use a lambda but you could also just define it.

out = df.dropna().groupby(grps).filter(lambda x: (x["Vote Diff"] > 0).all())

Gives

             Election Year  Votes  Vote Diff
Party Region
C     b               2005     95       50.0
      b               2010    100        5.0

Voila! We see that Party C/Region b is the only one with a year-on-year increase in votes.

We've not looked at how to combine this with your top parties requirement, but ones you have a list of top parties it is straight forward (slight modification of your top_2 code)

top_2 = (df[df["Election Year"] == "2000"]
           .groupby("Party")["Votes"].sum()
           .nlargest(2))
out.loc[top_2.index]

Putting it all together

top_2 = (df[df["Election Year"] == "2000"]
           .groupby("Party")["Votes"].sum()
           .nlargest(2))

grps = ["Party", "Region"]
df = df.set_index(grps).sort_index()
df["Vote Diff"] = df.groupby(grps)["Votes"].diff()
df.dropna().groupby(grps).filter(lambda x: (x["Vote Diff"] > 0).all()).loc[top_2.index]

Getting rows for increasing values of a column in each group in grouped dataframe

2 Answers2

output

open the logic

Putting it all together

Linked