2

I have two dataframes:

df1 - is a pivot table that has totals for both columns and rows, both with default names "All" df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.

I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.

When I perform multiplication, I get the following error:

ValueError: cannot join with no level specified and no overlapping names

When I try the same on dummy dataframes it works as expected:

import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
                  [50, 60, 70, 180],
                  [90, 10, 10, 110],
                  [150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)

table2 = np.matrix([[1.0, 2.0, 3.0],
                  [5.0, 6.0, 7.0],
                  [2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)

df3 = df1*df2
print(df3)

This gives me the following output:

         A   B    C  All
One     10  20   30   60
Two     50  60   70  180
Three   90  10   10  110
All    150  90  110  350

         A    B    C
One   1.00 2.00 3.00
Two   5.00 6.00 7.00
Three 2.00 1.00 5.00

           A  All      B      C
All      nan  nan    nan    nan
One    10.00  nan  40.00  90.00
Three 180.00  nan  10.00  50.00
Two   250.00  nan 360.00 490.00

So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".

And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:

df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
          columns = ['CR Range'], aggfunc = [np.sum], margins = True)

I do need to keep the total as I'm using them in other calculations.

I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.

Thank you for reading. I realize it's a very long post..

Tatiana
  • 85
  • 3
  • 10

3 Answers3

2

IIUC,

My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.

df1.mul(df2, fill_value=1)

           A    All      B      C
All    150.0  350.0   90.0  110.0
One     10.0   60.0   40.0   90.0
Three  180.0  110.0   10.0   50.0
Two    250.0  180.0  360.0  490.0

Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1

(df1 * df2).combine_first(df1)

           A    All      B      C
All    150.0  350.0   90.0  110.0
One     10.0   60.0   40.0   90.0
Three  180.0  110.0   10.0   50.0
Two    250.0  180.0  360.0  490.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Knowing how to add values from the first df in place of NaNs will definitely come in handy. What I'm trying to figure out it, though, is why I can not multiply my two real world dataframes at all. I get an error: ValueError: cannot join with no level specified and no overlapping names. And when I try the method df1.mul(df2, fill_value=1) I get an error: TypeError: f() got multiple values for argument 'fill_value' – Tatiana Feb 02 '18 at 22:11
  • @Tatiana asking questions about your real world data can be difficult. But learning how to reduce your real world data to a minimal representation that still presents the problem is a skill that is worth practicing. Do your best to show us a minimal sample of your real world data in which the problem you're expressing shows itself. – piRSquared Feb 02 '18 at 22:18
1

I really like Pir 's approach , and here is mine :-)

df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]: 
           A      B      C  All
One     10.0   40.0   90.0   60
Two    250.0  360.0  490.0  180
Three  180.0   10.0   50.0  110
All    150.0   90.0  110.0  350
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This seems such an elegant solution - it even preserved the layout! I'm still unable to multiply my real world dataframes, however. The error I'm getting with this approach is: `KeyError: "None of [Index(['CR Range', 'CR Below 85', 'CR 85-95', 'CR 95-105', 'CR 105-115',\n 'CR 115-122', 'CR Above 122'],\n dtype='object')] are in the [columns]"` These are my bins/categories for the original data that was aggregated by pivot_table method. Visually, they are definitely there but it seems as though python does not see them.. – Tatiana Feb 02 '18 at 22:35
  • @Tatiana Maybe slightly modify your sample data to you real data format ? – BENY Feb 02 '18 at 23:24
0

@Wen, @piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.

Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:

  1. Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
  2. Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .

  3. Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?

This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.

I was then able to multiply the two dataframes with no issues. I used approach suggested by @Wen because I like that it preserves the structure.

Tatiana
  • 85
  • 3
  • 10