0

Good day,

Problem: I have two data frames - performance per a firm aka output and input per a firm:

`firms = ['1', '2', '3']
df = pd.DataFrame(firms)
output = { 'firms': ['1', '2', '3'],
'Sales': [150, 200, 50],
'Profit':[200, 210, 90]}
df1 = pd.DataFrame.from_dict(output)
inputs = { 'firms': ['1', '2', '3'],
'Salary': [10000, 20000, 500],
'employees':[2, 4, 5]}
df2 = pd.DataFrame.from_dict(inputs)`

What I need is to divide every column from the output table to every column in the input table. As of now I am doing it in a very ugly manner - by dividing the entire output tbl by every individual column in the input table and then merging the result together. It's all good when I have two columns, but I wonder if there is a better way to do it as I might have 100 columns in one table and 50 in another. Ah, it's also important that the size might be different, e.g. 50 cols in the input and 100 in the output table.

frst = df1.iloc[:,0:2].divide(df2.Salary, axis = 0)
frst.columns = ['y1-x1', 'y2-x1']
sec = df1.iloc[:,0:2].divide(df2.employees, axis = 0)
sec.columns = ['y1-x2', 'y2-x2']
complete = pd.DataFrame(df).join(frst).join(sec)

Output:

| Firm | y1-x1 | y2-x1 | y1-x2 | y2-x2 |

| 1 | 0.0200 | 0.015 | 100.0 | 75.0 |

| 2 | 0.0105 | 0.010 | 52.5 | 50.0 |

| 3 | 0.1800 | 0.100 | 18.0 | 10.0 |

I also tried with loops but if I remember correctly because in my actual example, I have tables of different size, it did not work out. I will be very grateful for your suggestions!

eponkratova
  • 467
  • 7
  • 20
  • Can you reformat your dataframe (for example: df1 and df2) and also add the desired output you are looking for. – Karn Kumar Oct 04 '18 at 17:54
  • for this to be meaningful doesn't the length of input and output need to be the same? otherwise won't you be dividing by arbitrary values not relating to the relevant company? – Sven Harris Oct 04 '18 at 18:00
  • @pygo, I adjusted the code and added a sort of the desired output table - somehow markdown did not work here but I hope I delivered the message. – eponkratova Oct 04 '18 at 18:14
  • No, @Sven, it's not necessary that the # of the input variables equal to the number of the output vars. The list of firms is the same for both tables if you were asking about it. – eponkratova Oct 04 '18 at 18:19

2 Answers2

0

So I think the issue is that you are treating your data as essentially three-dimensional, where you have dimensions (firms, components of costs, components of income), and you want ratios for each of the outer product of the three dimensions.

There are certainly ways to accomplish what you'd like to do in a DataFrame, but they're messsy.

Pandas does have a 3-D object called a Panel, but this is being deprecated in favor of a more complete solution for indexed higher-dimensional data structures called xarray. Think of it as pandas for NDArrays.

We can convert your data into an xarray DataArray by labeling and stacking your indices:

In [2]: income = df1.set_index('firms').rename_axis(['income'], axis=1).stack('income').to_xarray()

In [3]: income
Out[3]:
<xarray.DataArray (firms: 3, income: 2)>
array([[150, 200],
       [200, 210],
       [ 50,  90]])
Coordinates:
  * firms    (firms) object '1' '2' '3'
  * income   (income) object 'Sales' 'Profit'

In [4]: costs = df2.set_index('firms').rename_axis(['costs'], axis=1).stack('costs').to_xarray()

In [5]: costs
Out[5]:
<xarray.DataArray (firms: 3, costs: 2)>
array([[10000,     2],
       [20000,     4],
       [  500,     5]])
Coordinates:
  * firms    (firms) object '1' '2' '3'
  * costs    (costs) object 'Salary' 'employees'

You now have two DataArrays, each with two dimensions, but the dimensions do not match. Both are indexed by firms, but income is indexed by income and costs is indexed by costs.

These are broadcast against each other automatically when operations are performed against both of them:

In [6]: income / costs
Out[6]:
<xarray.DataArray (firms: 3, income: 2, costs: 2)>
array([[[1.50e-02, 7.50e+01],
        [2.00e-02, 1.00e+02]],

       [[1.00e-02, 5.00e+01],
        [1.05e-02, 5.25e+01]],

       [[1.00e-01, 1.00e+01],
        [1.80e-01, 1.80e+01]]])
Coordinates:
  * firms    (firms) object '1' '2' '3'
  * income   (income) object 'Sales' 'Profit'
  * costs    (costs) object 'Salary' 'employees'

This data now has the structure you're trying to achieve, and this division was done using optimized cython operations rather than loops.

You can convert the data back to a dataframe using the built in DataArray.to_series method:

In [7]: (income / costs).to_series().to_frame(name='income to cost ratio')
Out[7]:
                        income to cost ratio
firms income costs
1     Sales  Salary                   0.0150
             employees               75.0000
      Profit Salary                   0.0200
             employees              100.0000
2     Sales  Salary                   0.0100
             employees               50.0000
      Profit Salary                   0.0105
             employees               52.5000
3     Sales  Salary                   0.1000
             employees               10.0000
      Profit Salary                   0.1800
             employees               18.0000
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • just saw your updated output specification. If you'd like the data in that format, you coule replace the `to_frame` call with `unstack(['income', 'costs'])` – Michael Delgado Oct 04 '18 at 18:47
  • It's very beautiful, @delgadom! I just re-named the columns and changed the type: `df = (income / costs).to_series().unstack(['income', 'costs']).astype(float) df.columns = ['profit-salary', 'profit-emp', 'sales-sal', 'sales-emp' ]`. Speaking of the type, you have floats in your answer but in my case, it has been showing ints, but if I change to float, I don't see the correct decimals, e.g. 0.1000 is 0.00. – eponkratova Oct 04 '18 at 19:42
  • I imagine you're probably using python 2.7? This is an integer division problem. If you convert your original data to floats (e.g. `df1.astype(float)`, `df2.astype(float)`) this will work. – Michael Delgado Oct 04 '18 at 19:50
  • see this answer on integer division in python: https://stackoverflow.com/questions/2958684/python-division/2958717#2958717 – Michael Delgado Oct 04 '18 at 19:52
  • Interestingly, I used Python 3. And even with me adding astype(float), did not help. I don't see the correct decimals, e.g. 0.1000 is 0.00. – eponkratova Oct 04 '18 at 19:56
  • Huh. xarray is a core pydata library that is used and tested by a huge user community so its handling of math operators is definitely not the issue. are you sure you followed my code exactly? I just ran this again and it worked just fine for me. If you would like help figuring out how to do this in xarray and why the behavior isn't what you expect feel free to post your code as another question! – Michael Delgado Oct 04 '18 at 21:30
0

I don't see why you can't just use a simple loop. It seems like you want to align everything on firms so setting that to the index will resolve any joins or divisions by unequal lengths.

df1 = df1.set_index('firms')
df2 = df2.set_index('firms')

l = []
for col in df2.columns:
    l.append(df1.div(df2[col], axis=0).add_suffix(f'_by_{col}'))
pd.concat(l, axis=1)

Output:

       Sales_by_Salary  Profit_by_Salary  Sales_by_employees  Profit_by_employees
firms                                                                            
1                0.015            0.0200                75.0                100.0
2                0.010            0.0105                50.0                 52.5
3                0.100            0.1800                10.0                 18.0
ALollz
  • 57,915
  • 7
  • 66
  • 89