1

I am struggling to get something like a matrix dataframe in pandas/NumPy.

I have the following dataframe

Value Source Placement First
A GooFeed 1 GooFeed
A CyberCyber 2 GooFeed
A GooFeed 3 GooFeed
B CyberCyber 1 CyberCyber
B GooFeed 2 CyberCyber
C WowFeed 1 WowFeed
C WowFeed 2 WowFeed
C CyberCyber 3 WowFeed
C GooFeed 4 WowFeed
C CyberCyber 5 WowFeed
D GooFeed 1 GooFeed

What I want to achieve is a percentage of the feeds having data from each other. The feeds with the 1 in Placement got the data (value A) first. Sometimes the same values appear in multiple data feeds and so are multiple placements from the same feed at one value, like with value C and CyberCyber. But the values should be unique and not count as 2 times data from WowFeed.

I calculated total values in a separate dataframe like the following:

Source Firsts TotalDistinctValues FirstSources
GooFeed 2 4 {'CyberCyber':1;'WowFeed':1}
CyberCyber 1 3 {'GooFeed':1;'WowFeed':1}
WowFeed 1 1

And the matrix I am looking for is the following:

source GooFeed CyberCyber WowFeed Total
GooFeed 25% 25% 50%
CyberCyber 33% 33% 66%
WowFeed
Average 33% 25% 58%

Average (bottom row) of the column values excluding the value source matching itself. Like GooFeed at GooFeed.

Total (outer row on the right) should be to total of its values from its own distinct values.

I really do not get my hand around how this should work. Am I even right to try it in pandas/NumPy?

I am so glad for every help.

Best regards

Maurice

Maurice F
  • 25
  • 5

2 Answers2

1

I managed to work with piterbargs to answer and achieve everything I need. Thank you piterbarg!

For the ones interested I will post the code because I edited it a bit (caution - different naming of column / values).

enter image description here

enter image description here

Maurice F
  • 25
  • 5
0

Need a few steps here. In the first step we drop duplicates to avoid double-counting and pivot:

df2 = pd.pivot_table(df.drop_duplicates(['Value','Source','First']), index = 'Source', columns = 'First',values = 'Value',aggfunc = 'count', fill_value = 0)
df2

we get

First       CyberCyber  GooFeed WowFeed
Source          
CyberCyber  1           1       1
GooFeed     1           2       1
WowFeed     0           0       1

Now we need to translate into proportions and calculate Totals. The somewhat tricky bit is to get rid of the diagonal (this is where np.eye() comes in)

import numpy as np
df3 = df2.div( df2.sum(axis=1), axis=0 )
df4 = df3*(1 - np.eye(len(df3))
df4['Total'] = df4.sum(axis=1)
df4.loc['Total'] = df4.sum(axis=0)
df4

and we get


First       CyberCyber  GooFeed     WowFeed     Total
Source              
CyberCyber  0.00        0.333333    0.333333    0.666667
GooFeed     0.25        0.000000    0.250000    0.500000
WowFeed     0.00        0.000000    0.000000    0.000000
Total       0.25        0.333333    0.583333    1.166667

if you want you can replace 0's with NaNs, say, using df4.replace(0,np.NaN) and manually set the bottom right cell to NaN if that's of consequence

piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • One word: Awesome! That was very helpfull and well explained, thank you. I managed to understand what you did and dropping the duplicates right at the start helped me with many more calculations. – Maurice F Mar 28 '21 at 10:05
  • @MauriceF Glad it helped! Consider marking the answer as accepted per [guidelines](https://stackoverflow.com/help/someone-answers) – piterbarg Mar 28 '21 at 10:07
  • I have just some more questions Total values The Total row (bottom) in the percentages makes no sense how I asked it to be. The values do not make sense and should be the average of the column including the 0 % and excluding the field in which the source matches its own column. Right? This should give the result: The percentage of all sources are coming from this source. – Maurice F Mar 28 '21 at 10:13
  • Question 2 It just saw, that something is wrong with the percentages. They are calculated even without values in this column at all. I will add a picture in another answer. – Maurice F Mar 28 '21 at 10:14
  • @MauriceF if you question is not quite right, you should edit your question rather than add the info as an answer – piterbarg Mar 28 '21 at 10:16
  • Thank you for the hint. I changed it in the original post. – Maurice F Mar 28 '21 at 10:20
  • For the average I changed the sum function to mean and it should work now. Pitersbarg should I mark your answer as solving and open a new question for the second part? Because I can not update the main post anymore. – Maurice F Mar 28 '21 at 10:42
  • @Maurice I think opening a new question is a good idea. Just make sure you have a [minimal example](https://stackoverflow.com/help/minimal-reproducible-example) and use code not pictures for your dataframes: [pandas](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – piterbarg Mar 28 '21 at 13:09