0

I have a dataframe with the following structure:

df= pd.DataFrame({"Origin": ['A','A','A','A','B','B','B'],"Name":["John","Bob","John","John","Bob","Elvis","John"],"Year":[2015,2015,2015,2015,2015,2016,2016]})

Origin |Name | Year
    A   John    2015
    A   Bob     2015
    A   John    2015
    A   John    2015
    B   Bob     2015
    B   Elvis   2016
    B   John    2016

I am trying to regroup the data as follow : for each year, I want to count the number of line per name per origin. I present it with the origin and year joint in a column name and the name in index. The expected output here should be as :

index  | A_2015 | A_2016 | B_2015 | B 2016 |
--------------------------------------------
John   | 3      | 0      | 0      |  1
Bob    | 1      | 0      | 1      |  0
Elvis  | 0      | 0      | 0      |  1

The origin can be separated in different dataframe it doesn't realy matter to me (my current code is dividing it).
My code is running but if find it non elegant as I combine a for loop and a group_by. Here is my shot, I was wondering if there was a better way to perform this transformation.

dfTotalA = pd.DataFrame()
dfTotalB = pd.DataFrame()
for Year in TotalData.Year.unique():
    df = TotalData.query("(Origin == 'A') & (YearHit == @Year)")[["Origin","Name"]].groupby('Name').Origin.size().to_frame(name='A_'+str(Year))
    dfTotalA= pd.concat([dfTotal,df1],sort=False,axis=1)
    df2 = TotalData.query("(Origin == 'B') & (YearHit == @Year)")[["Origin","Name"]].groupby('Name').Origin.size().to_frame(name='B_'+str(Year))
    dfTotalB= pd.concat([dftotalB,df2],sort=False,axis=1)
Completedf = pd.concat(dfTotalA,dftotalB],sort=False,axis=1)
Completedf.fillna(0,inplace=True)
Mayeul sgc
  • 1,964
  • 3
  • 20
  • 35

2 Answers2

1

One way to do so:

new_df = (df.groupby(['Origin', 'Year'])
   .Name
   .value_counts()
   .unstack(['Origin'], fill_value=0)
   .unstack('Year', fill_value=0)
)

which gives:

Origin    A         B     
Year   2015 2016 2015 2016
Name                      
Bob       1    0    1    0
Elvis     0    0    0    1
John      3    0    0    1

And then you can rename the columns with:

new_df.columns = [f'{a}_{b}' for a,b in new_df.columns]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

We can create a count column with GroupBy.transform. Then pivot_table. And finally flatten our MultiIndex columns:

df['cnt'] = df.groupby(['Origin', 'Name'])['Year'].transform('count')

piv = df.pivot_table(index='Name', columns=['Origin', 'Year'], fill_value=0)
piv.columns = [f'{c[1]}_{c[2]}'for c in piv.columns]

Output

       A_2015  B_2015  B_2016
Name                         
Bob         1       1       0
Elvis       0       0       1
John        3       0       1
Erfan
  • 40,971
  • 8
  • 66
  • 78