1

I have dataframe shaped like the following:

country institution department individual
USA Apple Marketing John Fowler
UK Apple Marketing Peter Pan
China Apple Finance John Fowler
Argentina Bosch Marketing Messi

I would like to create a weighted adjacency matrix that looked like the following:

USA UK China Argentina Apple Bosch Marketing Finance John Fowler Peter Pan Messi
USA 0 0 0 0 1 0 1 0 1 0 0
UK 0 0 0 0 1 0 1 0 0 1 0
China 0 0 0 0 1 0 0 1 1 0 0
Argentina 0 0 0 0 0 1 1 0 0 0 1
Apple 1 1 1 0 0 0 2 1 2 1 0
Bosch 0 0 0 1 0 0 1 0 0 0 1
Marketing 1 1 0 1 2 1 0 0 1 1 1
Finance 0 0 1 0 1 0 0 0 1 0 0
John Fowler 1 0 1 0 2 0 1 1 0 0 0
Peter Pan 0 1 0 0 1 0 1 0 0 0 0
Messi 0 0 0 1 0 1 1 0 0 0 0

I have seen examples here and here but I could not extend the solutions to more than 2 columns.

Beavis
  • 94
  • 9

2 Answers2

3

You can use a generator with itertools.permutations and pandas.crosstab:

from itertools import permutations

out = (pd.crosstab(*zip(*((a, b) for l in df.to_numpy().tolist()
                          for a,b in permutations(l, r=2))))
      ).rename_axis(index=None, columns=None)

Output:

             Apple  Argentina  Bosch  China  Finance  John Fowler  Marketing  Messi  Peter Pan  UK  USA
Apple            0          0      0      1        1            2          2      0          1   1    1
Argentina        0          0      1      0        0            0          1      1          0   0    0
Bosch            0          1      0      0        0            0          1      1          0   0    0
China            1          0      0      0        1            1          0      0          0   0    0
Finance          1          0      0      1        0            1          0      0          0   0    0
John Fowler      2          0      0      1        1            0          1      0          0   0    1
Marketing        2          1      1      0        0            1          0      1          1   1    1
Messi            0          1      1      0        0            0          1      0          0   0    0
Peter Pan        1          0      0      0        0            0          1      0          0   1    0
UK               1          0      0      0        0            0          1      0          1   0    0
USA              1          0      0      0        0            1          1      0          0   0    0

How it works:

  • for each row, generate all combinations of 2 items [(a,b), (b,a), ...]
  • reshape into 2 arrays [(a,b,...),(b,a,...)]
  • pass the arrays to crosstab expanding as the first two parameters
corresponding graph:

enter image description here

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Could you explain a little how this works? I just don't understand the logic but it seems very elegant. – JarroVGIT Feb 12 '23 at 17:36
  • 1
    @JarroVGIT sure, I'll edit the answer – mozway Feb 12 '23 at 17:45
  • @JarroVGIT, agree with you, this solution looks really elegant. – Beavis Feb 12 '23 at 18:01
  • @mozway, I have an error message: "If using all scalar values, you must pass an index." Could you shed light on how I can troubleshoot this? – Beavis Feb 12 '23 at 18:02
  • @Beavis can you provide a reproducible input that triggers the error? – mozway Feb 12 '23 at 18:03
  • @mozway I am learning: could you show me an example of a reproducible input? – Beavis Feb 12 '23 at 18:04
  • @Beavis the reproducible input of the question (which doesn't trigger an error for me): `df = pd.DataFrame({'country': ['USA', 'UK', 'China', 'Argentina'], 'institution': ['Apple', 'Apple', 'Apple', 'Bosch'], 'department': ['Marketing', 'Marketing', 'Finance', 'Marketing'], 'individual': ['John Fowler', 'Peter Pan', 'John Fowler', 'Messi']})` – mozway Feb 12 '23 at 18:12
  • I edited the answer to use permutations, which is better – mozway Feb 12 '23 at 18:15
  • @mozway while I was trying the new solution, the error 'repeat' is an invalid keyword argument for permutations(). Is there something else I should be importing? – Beavis Feb 12 '23 at 18:25
  • @Beavis my fault, I replaced `product` with `permutations`, they (unfortunately) use a different keyword for the repetitions. This is fixed now ;) – mozway Feb 12 '23 at 18:27
  • I think my problem is with my dataframe, which I brought in from a csv. I still have this "If you are using all scalar values, you must pass an index." My data frame did not come from a list, but from columns imported from csv. Any clues on how to resolve this? – Beavis Feb 12 '23 at 18:58
  • Hard to tell without the data, code and debug info. Is this error triggered before my code? – mozway Feb 12 '23 at 19:04
  • @mozway I used this trial data (which is a csv file) https://www.dropbox.com/s/gmaaf3kvtgkp1gm/trial_data.csv?dl=0 And then: `import pandas as pd` `import numpy as np` `df = pd.read_csv("trial_data.csv")` `from itertools import permutations` `out = (pd.crosstab(*zip(*((a, b) for l in df.to_numpy().tolist() for a,b in permutations(l, r=2)))) ).rename_axis(index=None,columns=None)` – Beavis Feb 12 '23 at 19:40
  • @Beavis I just tried with your file and this worked without issues, what are your python and pandas versions? Can you make sure to use recent versions? – mozway Feb 13 '23 at 06:09
  • 1
    You are right! I've never had version problems till now. You're a genius. Thanks for all the help. – Beavis Feb 13 '23 at 17:42
1

You can use sliding_window_view and networkx to get a weighted adjacency matrix:

from numpy.lib.stride_tricks import sliding_window_view
import networkx as nx

edges = sliding_window_view(df1.values, (len(df1), 2)).reshape(-1, 2)
edges = pd.DataFrame(edges, columns=['source', 'target']).value_counts().reset_index()

G = nx.from_pandas_edgelist(edges, edge_attr='count')
adjmx = nx.to_pandas_adjacency(G, weight='count').astype(int)

Output:

>>> adjmx
             Apple  Marketing  Finance  Argentina  Bosch  China  John Fowler  Messi  Peter Pan  UK  USA
Apple            0          2        1          0      0      1            0      0          0   1    1
Marketing        2          0        0          0      1      0            1      1          1   0    0
Finance          1          0        0          0      0      0            1      0          0   0    0
Argentina        0          0        0          0      1      0            0      0          0   0    0
Bosch            0          1        0          1      0      0            0      0          0   0    0
China            1          0        0          0      0      0            0      0          0   0    0
John Fowler      0          1        1          0      0      0            0      0          0   0    0
Messi            0          1        0          0      0      0            0      0          0   0    0
Peter Pan        0          1        0          0      0      0            0      0          0   0    0
UK               1          0        0          0      0      0            0      0          0   0    0
USA              1          0        0          0      0      0            0      0          0   0    0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • could you edit your solution? The matrix is unbalanced; individuals are left out of the rows. This looks like a really elegant solution! – Beavis Feb 12 '23 at 17:59
  • A sliding window view will only show adjacent pairs, not all combinations, this gives an incorrect output :( – mozway Feb 12 '23 at 18:34
  • @Beavis. Are you sure you are looking for a (weighted) adjacency matrix? – Corralien Feb 12 '23 at 18:40
  • @Corralien, yes, a weighted matrix. The output give me rows with repeated row headings. – Beavis Feb 12 '23 at 18:49
  • How USA could be connected to Marketing if you want an adjacency matrix? – Corralien Feb 12 '23 at 19:16
  • `networkx` can return a real weighted adjacency matrix. Can you check the output, you will see it's very different of what you expect. – Corralien Feb 12 '23 at 19:26
  • @Corralien, the code returned an error. `G = nx.from_pandas_edgelist(edges, edge_attr='count')` I changed it to: `G = nx.from_pandas_edgelist(edges, edge_attr=None)` It ran and it returned a balanced matrix, but unweighted. – Beavis Feb 12 '23 at 19:57
  • Try with `G = nx.from_pandas_edgelist(edges, edge_attr=True)`. You need `edge_attr` to use `count` column as edge attribute to get the weight. Don't forget `.value_counts().reset_index()`? – Corralien Feb 12 '23 at 20:02
  • Still unweighted :( – Beavis Feb 12 '23 at 20:11
  • What is your version of `networkx`? `print(nx.__version__)` – Corralien Feb 12 '23 at 20:11
  • The version is 2.6.3. – Beavis Feb 12 '23 at 20:14
  • I tested with your version and it works perfectly. Does `edges` dataframe is correct: 3 columns: source, target and count columns? – Corralien Feb 12 '23 at 20:25
  • I tried your approach again with updated python libraries, but I am still having some issues as it produces a non-weighted matrix. Updated my python to the latest version too. Not sure what's up, but thanks for introducing this method - which gets me inducted into networkx. Cheers – Beavis Feb 14 '23 at 06:17
  • I hope you will solve this problem. Remember that your expected output is not what is called "a weighted adjacency matrix" :-) – Corralien Feb 14 '23 at 07:26