1

I am new to Pandas.

Consider this my DataFrame:

df

Search              Impressions     Clicks      Transactions    ContainsBest       ContainsFree         Country
Best phone          10              5           1               True               False                UK
Best free phone     15              4           2               True               True                 UK
free phone          20              3           4               False              True                 UK
good phone          13              1           5               False              False                US
just a free phone   12              3           4               False              True                 US

I have columns ContainsBest and ContainsFree. I want to sum all Impressions, Clicks and Transactions where ContainsBest is True, then I want to sum Impressions, Clicks and Transactions where ContainsFree is True and do the same per unique value in column Country. So the new DataFrame would look like this:

output_df

Country             Impressions     Clicks      Transactions
UK                  45              12          7
ContainsBest        25              9           3
ContainsFree        35              7           6

US                  25              4           9
ContainsBest        0               0           0
ContainsFree        12              3           4

For this I would understand I would need to use something like the following:

uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')

uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])

uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])

Then I would apply the same logic for Clicks and Transactions and redo the same code for Country US.

The second thing I am trying to achieve is to add column TopCategories per Country and Impressions, Clicks and Transactions, so that my final_output_df would look like this:

final_output_df

Country             Impressions     Clicks      Transactions        TopCategoriesForImpressions     TopCategoriesForClicks          TopCategoriesForTransactions     
UK                  45              12          7                   ContainsFree                    ContainsBest                    ContainsFree
ContainsBest        25              9           3                   ContainsBest                    ContainsFree                    ContainsBest
ContainsFree        35              7           6

US                  25              4           9                   ContainsFree                    ContainsFree                    ContainsFree
ContainsBest        0               0           0
ContainsFree        12              3           4

The column TopCategoriesForxx logic is a simple sort of ContainsBest and ContainsFree rows under Country column. So that the TopCategoriesForImpressions for UK country is

  1. ContainsFree
  2. ContainsBest

While the TopCategoriesForClicks for UK country is:

  1. ContainsBest
  2. ContainsFree

I understand I would need to use something like this:

TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')

I just find difficult to put everything to look as in my last final_output_df. Also, I would assume I do not need output_df to be created, just wanted to add it for better understanding of what my steps are to achieve the final_output_df.

So my questions are:

  1. How to apply calculations based on one and multiple conditions? See row ContainsBest and ContainsFree
  2. How can I sort column values based on a condition? See column TopCategoriesForImpressions
  3. In reality, I have 70 countries and 20 columns Containsxxx, is there a way to achieve this without adding conditions for 70 countries and 20 Containsxxx columns?

Thank you very much for your suggestions.

Jonas Palačionis
  • 4,591
  • 4
  • 22
  • 55

1 Answers1

1

First part of solution should be:

#removed unnecessary column Search and added ContainAll column filled Trues
df1 = df.drop('Search', 1).assign(ContainAll = True)

#columns for tests
cols1 = ['Impressions','Clicks','Transactions']
cols2 = ['ContainsBest','ContainsFree','ContainAll']

print (df1[cols2].dtypes)
ContainsBest    bool
ContainsFree    bool
ContainAll      bool
dtype: object

print (df1[cols1].dtypes)
Impressions     int64
Clicks          int64
Transactions    int64
dtype: object

print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask'))
   Country  Impressions  Clicks  Transactions          Type   mask
0       UK           10       5             1  ContainsBest   True
1       UK           15       4             2  ContainsBest   True
2       UK           20       3             4  ContainsBest  False
3       US           13       1             5  ContainsBest  False
4       US           12       3             4  ContainsBest  False
5       UK           10       5             1  ContainsFree  False
6       UK           15       4             2  ContainsFree   True
7       UK           20       3             4  ContainsFree   True
8       US           13       1             5  ContainsFree  False
9       US           12       3             4  ContainsFree   True
10      UK           10       5             1    ContainAll   True
11      UK           15       4             2    ContainAll   True
12      UK           20       3             4    ContainAll   True
13      US           13       1             5    ContainAll   True
14      US           12       3             4    ContainAll   True

print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask').query('mask'))
   Country  Impressions  Clicks  Transactions          Type  mask
0       UK           10       5             1  ContainsBest  True
1       UK           15       4             2  ContainsBest  True
6       UK           15       4             2  ContainsFree  True
7       UK           20       3             4  ContainsFree  True
9       US           12       3             4  ContainsFree  True
10      UK           10       5             1    ContainAll  True
11      UK           15       4             2    ContainAll  True
12      UK           20       3             4    ContainAll  True
13      US           13       1             5    ContainAll  True
14      US           12       3             4    ContainAll  True

#all possible combinations of Country and boolean columns
mux = pd.MultiIndex.from_product([df['Country'].unique(), cols2], 
                                  names=['Country','Type'])

#reshape by melt for all boolean column to one mask column
#filter Trues by loc and aggregate sum
#add 0 rows by reindex
df1 = (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask')
          .query('mask')
          .drop('mask', axis=1)
          .groupby(['Country','Type'])
          .sum()
          .reindex(mux, fill_value=0)
          .reset_index())
print (df1)
  Country          Type  Impressions  Clicks  Transactions
0      UK  ContainsBest           25       9             3
1      UK  ContainsFree           35       7             6
2      UK    ContainAll           45      12             7
3      US  ContainsBest            0       0             0
4      US  ContainsFree           12       3             4
5      US    ContainAll           25       4             9

For second is possible filter rows for check sortings with numpy.argsort with descending order per groups:

def f(x):
    i = x.index.to_numpy()
    a = i[(-x.to_numpy()).argsort(axis=0)]
    return pd.DataFrame(a, columns=x.columns)


df2 = (df1[df1['Type'].isin(['ContainsBest','ContainsFree']) &
          ~df1[cols1].eq(0).all(1)]
           .set_index('Type')
           .groupby('Country')[cols1]
           .apply(f)
           .add_prefix('TopCategoriesFor')
           .rename_axis(['Country','Type'])
           .rename({0:'ContainsBest', 1:'ContainsFree'})
)
print (df2)
                     TopCategoriesForImpressions TopCategoriesForClicks  \
Country Type                                                              
UK      ContainsBest                ContainsFree           ContainsBest   
        ContainsFree                ContainsBest           ContainsFree   
US      ContainsBest                ContainsFree           ContainsFree   

                     TopCategoriesForTransactions  
Country Type                                       
UK      ContainsBest                 ContainsFree  
        ContainsFree                 ContainsBest  
US      ContainsBest                 ContainsFree  

df3 = df1.join(df2, on=['Country','Type'])
print (df3)
  Country          Type  Impressions  Clicks  Transactions  \
0      UK  ContainsBest           25       9             3   
1      UK  ContainsFree           35       7             6   
2      UK    ContainAll           45      12             7   
3      US  ContainsBest            0       0             0   
4      US  ContainsFree           12       3             4   
5      US    ContainAll           25       4             9   

  TopCategoriesForImpressions TopCategoriesForClicks  \
0                ContainsFree           ContainsBest   
1                ContainsBest           ContainsFree   
2                         NaN                    NaN   
3                ContainsFree           ContainsFree   
4                         NaN                    NaN   
5                         NaN                    NaN   

  TopCategoriesForTransactions  
0                 ContainsFree  
1                 ContainsBest  
2                          NaN  
3                 ContainsFree  
4                          NaN  
5                          NaN  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • for some reason I do get the same structure as you have but all of my values are `0`. Any idea why would that be the case? I am writing the `df` to a csv before using the code you have added, maybe `to_csv` makes the `df` empty? – Jonas Palačionis Jan 10 '20 at 14:18
  • @JonasPalačionis - hmmm, are data numeric? – jezrael Jan 10 '20 at 14:19
  • checking that right now. I also do get a warning `FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative.` Not sure if that can affect the script. – Jonas Palačionis Jan 10 '20 at 14:22
  • @JonasPalačionis - One idea what is `print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask').dtypes)` ? – jezrael Jan 10 '20 at 14:24
  • yes, `Impressions`, `Transactions` and `Clicks` are type `int64` while `Country` is an `object` – Jonas Palačionis Jan 10 '20 at 14:24
  • @JonasPalačionis - is column `mask` boolean? – jezrael Jan 10 '20 at 14:25
  • @JonasPalačionis - I can add some `print`s for check where is problem, give me some time. – jezrael Jan 10 '20 at 14:31
  • answer to your question - the type is `Country object Impressions float64 Clicks float64 Transactions float64 Type object mask object dtype: object`, I know checked and there are one value in `Impressions` of `8` and 3 values of `1` in `Clicks` column. – Jonas Palačionis Jan 10 '20 at 14:36
  • @JonasPalačionis - there is problem, there should be boolean in mask column – jezrael Jan 10 '20 at 14:37
  • @JonasPalačionis - If check `print (df1[cols2].dtype)` after first row of code `df1 = df.drop('Search', 1).assign(ContainAll = True)` return all `boolean` ? – jezrael Jan 10 '20 at 14:38
  • yes, all return `bool`, `ContainsBest bool ContainsFree bool ContainAll bool` – Jonas Palačionis Jan 10 '20 at 14:46
  • 1
    @JonasPalačionis - Added some prints to answer for check it. – jezrael Jan 10 '20 at 14:52
  • looks like it works, previously I have added more `Containsxxx` columns, not sure why it broke the code. In reality, I have like 15 `Containsxxx` fields which I add to your defined `cols2` list. Will try adding one by one to check where it breaks. Or does there other changes that need to be made in order to include more items, not only in `col2`? – Jonas Palačionis Jan 10 '20 at 16:28
  • managed to put everything in place with my 15 `contains` columns and sort them. Thank you very much for your help, just one thing I cant quite understand, how does the sorting work for each of 3 `contains` columns. I see you have a defined function but do not invoke it anywhere. Thats some high level code, thank you so much sir! – Jonas Palačionis Jan 10 '20 at 17:30