Pandas: If two or more rows have same values in particular column then get its count and add to next row

Question

Sort data frame by values of 5th column ["Card"] and add a new row after each card nos with its count. After sorting values how can I add a new row with Total:

Dataframe looks something like this

This is how I want output data frame

please do not use images of data, rather provide a minimal text example — mozway, Jul 12 '22 at 12:00

Michael S. · Accepted Answer · 2022-07-18T13:01:19.597

You can give this a try:

import pandas as pd

# create dummy df
card = ["2222","2222","1111","2222","1111","3333"]
name = ["Ed", "Ed",  "John", "Ed", "John", "Kevin"]
phone = ["1##-###-####", "1##-###-####", "2##-###-####", "1##-###-####", "2##-###-####", "3##-###-####"]
df = pd.DataFrame({"Name":name, "Phone":phone, "Card":card})

# sort by Card value
df = df.sort_values(by=["Card"]).reset_index(drop=True)

# Groupby the Card value, count them, then insert a new row based on that count
index = 0
line = []
for x in df.groupby("Card").size():
    index += x
    line.append(pd.DataFrame({"Name": "", "Phone":"", "Card": str(x)}, index=[index]))
df = df.append(line, ignore_index=False)
df = df.sort_values(by=["Card"]).sort_index().reset_index(drop=True)
df

Output:

    Name    Phone           Card
0   Ed      1##-###-####    1111
1   Ed      1##-###-####    1111
2   Ed      1##-###-####    1111
3                              3
4   John    2##-###-####    2222
5   John    2##-###-####    2222
6                              2
7   Kevin   3##-###-####    3333
8                              1

Edit ~~~~

Due to OP's use of string for card numbers, an edit had to be made to account for naturally sorting string ints

import pandas as pd
from natsort import natsort_keygen ##### Now needed because OP has Card numbers as strings

# create dummy df ##############
card = ["1111", "2222", "3333", "4444", "5555", "6666", "7777", "8888"]
name = ["Ed", "John", "Jake", "Mike", "Liz", "Anne", "Deb", "Steph"]
phone = ["1###", "2###", "3###", "4###", "5###", "6###", "7###", "8###"]
dfList = [a for a in zip(name, phone, card)]
dfList = [dfList[random.randrange(len(dfList))] for i in range(50)]
df = pd.DataFrame(dfList, columns=["Name", "Phone", "Card"])
################################

# sort by Card value
df = df.sort_values(by=["Card"]).reset_index(drop=True)

# Groupby the Card value, count them, then insert a new row based on that count
index = 0
line = []
for x in df.groupby("Card").size():
    index += x
    line.append(pd.DataFrame({"Name": "", "Phone":"", "Card": str(x)}, index=[index-1]))
df = pd.concat([df, pd.concat(line)], ignore_index=False)

# Create an Index column to be used in the by pandas sort_values
df["Index"] = df.index

# Sort the values first by index then by card number, use "natsort_keygen()" to naturally sort ints that are strings
df = df.sort_values(by = ['Index', 'Card'], key=natsort_keygen(), ascending = [True, False]).reset_index(drop=True).drop(["Index"], axis=1)

It is advisable to never pd.concat nor pd.Dataframe.append inside of a loop, it leads to [quadratic copying](https://stackoverflow.com/a/36489724/6361531) hence the bigger the dataframe is the slower the operation becomes expontentially. Instead create an empty list, and append dataframes to that list after looping then you pd.concat that list of dataframes. — Scott Boston, Jul 12 '22 at 12:49
Thank you @[Scott Boston](https://stackoverflow.com/users/6361531/scott-boston), I've updated my answer with an empty list — Michael S., Jul 12 '22 at 13:05
@MichaelS. For short number of rows it's working properly but for multiple rows it is not properly grouping as per requirement — Adataguy, Jul 18 '22 at 06:07
@Ajinkz Check the edit, it was failing because your Card values are strings, not ints, and thus wasn't naturally sorting them. The edit now accounts for that (you'll have to install natsort) — Michael S., Jul 18 '22 at 13:02

score 0 · Answer 2 · answered Jul 20 '22 at 13:38

I'm not sure if this is the best way but it worked for me.


list_of_df = [v for k, v in df.groupby("card")]
for i in enumerate(list_of_df):
    list_of_df[i[0]] = list_of_df[i[0]].append({"card":str(list_of_df[i[0]].shape[0])}, ignore_index=True)

final_df = pd.concat(list_of_df)

Pandas: If two or more rows have same values in particular column then get its count and add to next row

2 Answers2

Edit ~~~~