1

I have a very large dataframe, a sample of which looks like this:

df = pd.DataFrame({'From':['a','b','c','a','d'], 'To':['b', 'c', 'a', 'd', 'e'], 'Rates':[1e-4, 2.3e-2, 1e-2, 100, 70]})

In[121]: df
Out[121]: 
  From To     Rates
0    a  b    0.0001
1    b  c    0.0230
2    c  a    0.0100
3    a  d  100.0000
4    d  e   70.0000

The end result I would like is a n x n matrix model_matrix (where n = len(df)) that looks like this:

      a       b      c      d     e
a  0.00  0.0001  0.000  100.0   0.0
b  0.00  0.0000  0.023    0.0   0.0
c  0.01  0.0000  0.000    0.0   0.0
d  0.00  0.0000  0.000    0.0  70.0
e  0.00  0.0000  0.000    0.0   0.0

The following code works by converting df to a dictionary transfer_rates and filling in the model_matrix with values from the dictionary, but it is very inefficient for a large df.

from_comps = list(df['From'])
to_comps = list(df['To'])
transfer_rates = {}
for from_comp in from_comps:
    for to_comp in to_comps:
        try:
            transfer_rates[from_comp, to_comp] = df.loc[(df['From'] == from_comp) & (df['To'] == to_comp)]['Rates'].values[0]
        except:
            pass

all_comps = sorted(set(from_comps+to_comps))

model_matrix = pd.DataFrame(columns=sorted(all_comps),index=sorted(all_comps))
for rate in transfer_rates:
    model_matrix[rate[1]][rate[0]] = transfer_rates[rate]
model_matrix.fillna(0, inplace=True)

Is there a more efficient way of doing this?

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
DPdl
  • 723
  • 7
  • 23

3 Answers3

2

This works:

In [85]: df2 = df.pivot(index="From", columns="To", values="Rates")                                                                                                                                                

In [86]: full_index = df2.index.union(df2.columns)                                                                                                                                                                 

In [87]: df2 = df2.reindex(labels=full_index, axis=0).reindex(labels=full_index, axis=1).fillna(0.0)                                                                                                               

In [88]: df2                                                                                                                                                                                                       
Out[88]: 
      a       b      c      d     e
a  0.00  0.0001  0.000  100.0   0.0
b  0.00  0.0000  0.023    0.0   0.0
c  0.01  0.0000  0.000    0.0   0.0
d  0.00  0.0000  0.000    0.0  70.0
e  0.00  0.0000  0.000    0.0   0.0

Adapted from this question: How to make a rectangular matrix square on pandas dataframe

tuxdna
  • 8,257
  • 4
  • 43
  • 61
1

You can use pandas.pivot():

>>> df2 = df.pivot(index="From", columns="To", values="Rates")
>>> allopts=list("abcdefgh") #Define all possible options
>>> df3=pd.DataFrame(index=allopts, columns=allopts) #Create empty df to cover for empty rows/cols
>>> df4=pd.concat([df2,df3], sort=False)
>>> df4.groupby(df4.index).first().fillna(0)
      a       b      c      d     e    f    g    h
a  0.00  0.0001  0.000  100.0   0.0  0.0  0.0  0.0
b  0.00  0.0000  0.023    0.0   0.0  0.0  0.0  0.0
c  0.01  0.0000  0.000    0.0   0.0  0.0  0.0  0.0
d  0.00  0.0000  0.000    0.0  70.0  0.0  0.0  0.0
e  0.00  0.0000  0.000    0.0   0.0  0.0  0.0  0.0
f  0.00  0.0000  0.000    0.0   0.0  0.0  0.0  0.0
g  0.00  0.0000  0.000    0.0   0.0  0.0  0.0  0.0
h  0.00  0.0000  0.000    0.0   0.0  0.0  0.0  0.0

Ref:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • 2
    This was the solution I had, but it misses off the row `e`, which should be filled with `0` according to the question. – CDJB Jan 07 '20 at 16:52
  • Thank you. The issue is that I need a square matrix for further operations, and this does not give me the 'From' compartment where there is no corresponding 'To'. – DPdl Jan 07 '20 at 16:53
1

Use set_index, unstack and align

df1 = df.set_index(['From','To']).Rates.unstack(fill_value=0)
df_final = df1.align(df1.T, fill_value=0)[0]

Out[586]:
      a       b      c      d     e
a  0.00  0.0001  0.000  100.0   0.0
b  0.00  0.0000  0.023    0.0   0.0
c  0.01  0.0000  0.000    0.0   0.0
d  0.00  0.0000  0.000    0.0  70.0
e  0.00  0.0000  0.000    0.0   0.0
Andy L.
  • 24,909
  • 4
  • 17
  • 29