I am new to Pandas.
Consider this my DataFrame:
df
Search Impressions Clicks Transactions ContainsBest ContainsFree Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US
I have columns ContainsBest
and ContainsFree
. I want to sum all Impressions
, Clicks
and Transactions
where ContainsBest
is True
, then I want to sum Impressions
, Clicks
and Transactions
where ContainsFree
is True and do the same per unique value in column Country
. So the new DataFrame would look like this:
output_df
Country Impressions Clicks Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6
US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4
For this I would understand I would need to use something like the following:
uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')
uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])
uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])
Then I would apply the same logic for Clicks
and Transactions
and redo the same code for Country
US
.
The second thing I am trying to achieve is to add column TopCategories
per Country
and Impressions
, Clicks
and Transactions
, so that my final_output_df
would look like this:
final_output_df
Country Impressions Clicks Transactions TopCategoriesForImpressions TopCategoriesForClicks TopCategoriesForTransactions
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6
US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4
The column TopCategoriesForxx
logic is a simple sort of ContainsBest
and ContainsFree
rows under Country
column. So that the TopCategoriesForImpressions
for UK
country is
- ContainsFree
- ContainsBest
While the TopCategoriesForClicks
for UK
country is:
- ContainsBest
- ContainsFree
I understand I would need to use something like this:
TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')
I just find difficult to put everything to look as in my last final_output_df
. Also, I would assume I do not need output_df
to be created, just wanted to add it for better understanding of what my steps are to achieve the final_output_df
.
So my questions are:
- How to apply calculations based on one and multiple conditions? See row
ContainsBest
andContainsFree
- How can I sort column values based on a condition? See column
TopCategoriesForImpressions
- In reality, I have 70 countries and 20 columns
Containsxxx
, is there a way to achieve this without adding conditions for 70 countries and 20Containsxxx
columns?
Thank you very much for your suggestions.