Find, count and extract duplications in dataframes in python

Question

I have a dataframe includes two columns [Color, Value] as shown in the figure below:

df1 = pd.DataFrame({"Color":[Green, Blue, Green, Green, Blue, Red, Green, Green, Blue, Red, Green, Green, Blue, Red, Blue, Blue, Green], 
                "Value":[20, 21, 25, 30, 41, 512, 40, 41, 352, 31, 52, 451, 253, 54, 142, 122, 784 ]})

The aim is to find a list of duplications of color's column and count them. I used this code:

dups_colors = pd.concat(g for _, g in data.groupby("Color") if len(g) > 1)#find duplications

The output is:

Then, I counted the duplications using this code:

count = dups_colors.pivot_table(index=['Color'], aggfunc='size')#count number of duplications

The output is:

Until here everything is fine. Now I have two questions as follows:

Q1: I need to add the count column into the dups_colors dataframe in this mannar.

Q2: I need to extract the values correspond to each color into a sperate column with header of the color, same like this:

Please have a look and help me.

I'd be happy to help, but I won't type in the pictures. Please provide a small set of sample data as text that we can copy and paste. Include the corresponding desired result. Check out the guide on [how to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/3620003). — timgeb, Jun 25 '20 at 12:35
Question 2 is pivot table. Search SO for *How to pivot a dataframe* — Quang Hoang, Jun 25 '20 at 12:36
@timgeb I updated the question such that you can use the test sample of data. Could you help? — Mohsen Ali, Jun 25 '20 at 12:52

score 1 · Accepted Answer · answered Jun 25 '20 at 12:53

Regarding Q1 I would do:

import pandas as pd
df = pd.DataFrame({'color':['Green','Green','Green','Blue','Blue','Red']})
count = {'Green':3, 'Blue':2}
df['Count'] = None
for name, occur in df.groupby('color').indices.items():
    df['Count'][occur[0]] = count.get(name)
print(df)

Output:

  color Count
0  Green     3
1  Green  None
2  Green  None
3   Blue     2
4   Blue  None
5    Red  None

Note: I simplified your example and decided to use None for all non-first occurences of given color and single colors.

Your answer seems logical, but when tried it I got all Count column values are None!!. I got this Warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame — Mohsen Ali, Jun 25 '20 at 13:26

cyau · Answer 2 · 2020-06-25T13:27:37.383

If you want to count duplicates, I would suggest using pandas.DataFrame.duplicated():

df['dupl'] = df.duplicated('Color', False)

Where the False parameter counts all duplicated values as per the documentation here.

If you want to add a new column showing how many duplicates that row has, you can use:

df['duplicated_count'] = df.groupby('Color').dupl.transform('sum')

At last, you can reshape the dataframe as follows, although this is not quite what you expect:

df.pivot(columns='Color', values='Value').fillna(0)

I should say that all columns in a data frame should have the same length, so the result in Q2 is not achievable as it looks in Excel.

Find, count and extract duplications in dataframes in python

2 Answers2