0

I have the following (toy) dataset, which looks similar to the production data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'System_Key':['MER-002', 'MER-003', 'MER-004', 'MER-005', 'BAV-378', 'BAV-379', 'BAV-380', 'BAV-381', 'AUD-220', 'AUD-221', 'AUD-222', 'AUD-223'],
                   'Manufacturer':['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'BMW', 'BMW', 'BMW', 'BMW', 'Audi', 'Audi', 'Audi', 'Audi'],
                   'Region':['Americas', 'Europe', 'Americas', 'Asia', 'Asia', 'Europe', 'Europe', 'Europe', 'Americas', 'Asia', 'Americas', 'Americas'],
                   'Department':[np.nan, 'Sales', np.nan, 'Operations', np.nan, np.nan, 'Accounting', np.nan, 'Finance', 'Finance', 'Finance', np.nan],
                   'Approver':[np.nan, 'Jones, T.', np.nan, 'Jones, T.', 'Jones, T.', 'Smith, W.', np.nan, np.nan, 'Jones T.', 'Williams, S.', 'Williams, S.', np.nan]
                  })

    System_Key  Manufacturer    Region       Department   Approver
0   MER-002     Mercedes        Americas     NaN          NaN
1   MER-003     Mercedes        Europe       Sales        Jones, T.
2   MER-004     Mercedes        Americas     NaN          NaN
3   MER-005     Mercedes        Asia         Operations   Jones, T.
4   BAV-378     BMW             Asia         NaN          Jones, T.
5   BAV-379     BMW             Europe       NaN          Smith, W.
6   BAV-380     BMW             Europe       Accounting   NaN
7   BAV-381     BMW             Europe       NaN          NaN
8   AUD-220     Audi            Americas     Finance      Jones T.
9   AUD-221     Audi            Asia         Finance      Williams, S.
10  AUD-222     Audi    Americas             Finance      Williams, S.
11  AUD-223     Audi    Americas             NaN          NaN

I then create a custom column:

df['Rebate_Plan'] = np.where(df['System_Key'].str.contains('BAV', na=False), 'Jupyter',
                    np.where(df['System_Key'].str.contains('AUD', na=False), 'Uranus',
                    np.where((df['System_Key'].str.contains('MER', na=False)) & (df['Approver'].str.contains('Jones', na=False)), 'Saturn',
                    np.where((df['System_Key'].str.contains('MER')) & (~df['Approver'].str.contains('Jones')), 'Pluto', '*No Plan*'))))

But, I get the following error message:

TypeError: bad operand type for unary ~: 'float'

Note: the error message I get on the production data is:

TypeError: ufunc 'invert' not supported for input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

I believe the invert function only handles integer and boolean types. So, I convert the data using:

df = df.all.applymap(str)

(which is discussed in Python - Turn all items in a Dataframe to strings)

Doing so resolves that error message and results in the following:

    System_Key  Manufacturer    Region      Department    Approver    Rebate_Plan
0   MER-002     Mercedes        Americas    nan           nan         Pluto
1   MER-003     Mercedes        Europe      Sales         Jones, T.   Saturn
2   MER-004     Mercedes        Americas    nan           Smith, W.   Pluto
3   MER-005     Mercedes        Asia        Operations    Jones, T.   Saturn
4   MER-006     Mercedes        Asia        nan           Jones, T.   Saturn
5   MER-007     Mercedes        Europe      nan           nan         Pluto
6   BAV-378     BMW             Europe      Accounting    nan         Jupyter
7   BAV-379     BMW             Asia        Finance       nan         Jupyter
8   BAV-380     BMW             Europe      Finance       nan         Jupyter
9   BAV-381     BMW             Europe      nan           nan         Jupyter
10  AUD-220     Audi            Americas    Finance       nan         Uranus
11  AUD-221     Audi            Asia        Finance       nan         Uranus
12  AUD-222     Audi            Americas    Finance       nan         Uranus
13  AUD-223     Audi            Americas    nan           nan         Uranus

Lastly, I create the pivot table:

pivot_data = pd.pivot_table(data=df, index='Manufacturer', columns='Approver', values='System_Key', aggfunc='count', margins=True, margins_name='TOTALS', fill_value=0)

pivot_data

Which results in:

Approver      Jones, T.   Smith, W.   nan   TOTALS
Manufacturer                
Audi          0           0           4     4
BMW           0           0           4     4
Mercedes      3           1           2     6
TOTALS        3           1          10    14

But, I would like to remove the "Audi" and "BMW" lines from the pivot_table (since they're blank and contain NaN values), such that the finished product looks as follows:

Approver      Jones, T.   Smith, W.  TOTALS
Manufacturer                
Mercedes      3           1          4      
TOTALS        3           1          4   

And, I can do this simply with the following:

pivot_data = pd.pivot_table(data=df, index='Manufacturer', columns='Approver', values='System_Key', aggfunc='count', margins=True, margins_name='TOTALS', fill_value=0)

BUT, THIS ONLY WORKS IF CALLED BEFORE ADDING THE DERIVED 'REBATE_PLAN' COLUMN and calling df = df.all.applymap(str).

Does anyone know of a way to work around this?

Thanks in advance!

equanimity
  • 2,371
  • 3
  • 29
  • 53

2 Answers2

1

I think you should use numpy.select to define your conditions. And then afterwards just use dropna() to get rid of the data you don't want. Here's my solution (I noticed your df is different than the example printout you showed btw.)

df = pd.DataFrame({'System_Key': ['MER-002', 'MER-003', 'MER-004', 'MER-005', 'BAV-378', 'BAV-379', 'BAV-380',
                                  'BAV-381', 'AUD-220', 'AUD-221', 'AUD-222', 'AUD-223'],
                   'Manufacturer': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'BMW', 'BMW', 'BMW', 'BMW', 'Audi',
                                    'Audi', 'Audi', 'Audi'],
                   'Region': ['Americas', 'Europe', 'Americas', 'Asia', 'Asia', 'Europe', 'Europe', 'Europe',
                              'Americas', 'Asia', 'Americas', 'Americas'],
                   'Department': [np.nan, 'Sales', np.nan, 'Operations', np.nan, np.nan, 'Accounting', np.nan,
                                  'Finance', 'Finance', 'Finance', np.nan],
                   'Approver': [np.nan, 'Jones, T.', np.nan, 'Jones, T.', 'Jones, T.', 'Smith, W.', np.nan, np.nan,
                                'Jones, T.', 'Williams, S.', 'Williams, S.', np.nan]
                   })

# create rebate plan column
in_col = lambda col, string: df[col].str.contains(string, na=False)
conds = {
    'Jupyter': in_col('System_Key', 'BAV'),
    'Uranus': in_col('System_Key', 'AUD'),
    'Saturn': in_col('System_Key', 'MER') & in_col('Approver', 'Jones'),
    'Pluto': in_col('System_Key', 'MER') & ~in_col('Approver', 'Jones')
}
df['Rebate_Plan'] = np.select(condlist=conds.values(), choicelist=conds.keys())

# drop na and pivot values
df_piv = df.dropna().pivot_table(index='Manufacturer', columns='Approver', values='System_Key',
                            aggfunc='count', margins=True, margins_name='TOTALS', fill_value=0)

teepee
  • 2,620
  • 2
  • 22
  • 47
  • I applied your solution (which is elegant!) to the production data and see two issues. Firstly, if I don't use `df = df.all.applymap(str)`, then I get incorrect results in the `Rebate_Plan` column. And, if I do use `df = df.all.applymap(str)`, then I get zero values ("0") where `Uranus` should be written. Any idea what could cause this? – equanimity Nov 20 '20 at 01:24
  • Hmm not sure. What do you mean by incorrect results in `Rebate_Plan` for example? – teepee Nov 20 '20 at 01:54
  • By incorrect results, I mean that "Pluto" is written when it should be "Saturn". It seems as though the "Saturn" case is completely disregarded. I'm wondering if it has to do with NaN values. – equanimity Nov 20 '20 at 02:08
  • I passed `df.replace(['None', 'nan'], np.nan, inplace=True)` immediately before creating the pivot table. It worked. – equanimity Nov 20 '20 at 16:35
0

The issue is resolved by passing df.replace(['None', 'nan'], np.nan, inplace=True) immediately before creating the pivot table.

equanimity
  • 2,371
  • 3
  • 29
  • 53