create adjacency matrix from unique values from multiple columns

Question

I have dataframe shaped like the following:

country	institution	department	individual
USA	Apple	Marketing	John Fowler
UK	Apple	Marketing	Peter Pan
China	Apple	Finance	John Fowler
Argentina	Bosch	Marketing	Messi

I would like to create a weighted adjacency matrix that looked like the following:

	USA	UK	China	Argentina	Apple	Bosch	Marketing	Finance	John Fowler	Peter Pan	Messi
USA	0	0	0	0	1	0	1	0	1	0	0
UK	0	0	0	0	1	0	1	0	0	1	0
China	0	0	0	0	1	0	0	1	1	0	0
Argentina	0	0	0	0	0	1	1	0	0	0	1
Apple	1	1	1	0	0	0	2	1	2	1	0
Bosch	0	0	0	1	0	0	1	0	0	0	1
Marketing	1	1	0	1	2	1	0	0	1	1	1
Finance	0	0	1	0	1	0	0	0	1	0	0
John Fowler	1	0	1	0	2	0	1	1	0	0	0
Peter Pan	0	1	0	0	1	0	1	0	0	0	0
Messi	0	0	0	1	0	1	1	0	0	0	0

I have seen examples here and here but I could not extend the solutions to more than 2 columns.

mozway · Accepted Answer · 2023-02-13T06:21:56.827

3

You can use a generator with itertools.permutations and pandas.crosstab:

from itertools import permutations

out = (pd.crosstab(*zip(*((a, b) for l in df.to_numpy().tolist()
                          for a,b in permutations(l, r=2))))
      ).rename_axis(index=None, columns=None)

Output:

             Apple  Argentina  Bosch  China  Finance  John Fowler  Marketing  Messi  Peter Pan  UK  USA
Apple            0          0      0      1        1            2          2      0          1   1    1
Argentina        0          0      1      0        0            0          1      1          0   0    0
Bosch            0          1      0      0        0            0          1      1          0   0    0
China            1          0      0      0        1            1          0      0          0   0    0
Finance          1          0      0      1        0            1          0      0          0   0    0
John Fowler      2          0      0      1        1            0          1      0          0   0    1
Marketing        2          1      1      0        0            1          0      1          1   1    1
Messi            0          1      1      0        0            0          1      0          0   0    0
Peter Pan        1          0      0      0        0            0          1      0          0   1    0
UK               1          0      0      0        0            0          1      0          1   0    0
USA              1          0      0      0        0            1          1      0          0   0    0

How it works:

for each row, generate all combinations of 2 items [(a,b), (b,a), ...]
reshape into 2 arrays [(a,b,...),(b,a,...)]
pass the arrays to crosstab expanding as the first two parameters

corresponding graph:

edited Feb 13 '23 at 06:21

answered Feb 12 '23 at 17:22

mozway

194,879
13
39
75

Could you explain a little how this works? I just don't understand the logic but it seems very elegant. – JarroVGIT Feb 12 '23 at 17:36
1

@JarroVGIT sure, I'll edit the answer – mozway Feb 12 '23 at 17:45
@JarroVGIT, agree with you, this solution looks really elegant. – Beavis Feb 12 '23 at 18:01
@mozway, I have an error message: "If using all scalar values, you must pass an index." Could you shed light on how I can troubleshoot this? – Beavis Feb 12 '23 at 18:02
@Beavis can you provide a reproducible input that triggers the error? – mozway Feb 12 '23 at 18:03
@mozway I am learning: could you show me an example of a reproducible input? – Beavis Feb 12 '23 at 18:04
@Beavis the reproducible input of the question (which doesn't trigger an error for me): `df = pd.DataFrame({'country': ['USA', 'UK', 'China', 'Argentina'], 'institution': ['Apple', 'Apple', 'Apple', 'Bosch'], 'department': ['Marketing', 'Marketing', 'Finance', 'Marketing'], 'individual': ['John Fowler', 'Peter Pan', 'John Fowler', 'Messi']})` – mozway Feb 12 '23 at 18:12
I edited the answer to use permutations, which is better – mozway Feb 12 '23 at 18:15
@mozway while I was trying the new solution, the error 'repeat' is an invalid keyword argument for permutations(). Is there something else I should be importing? – Beavis Feb 12 '23 at 18:25
@Beavis my fault, I replaced `product` with `permutations`, they (unfortunately) use a different keyword for the repetitions. This is fixed now ;) – mozway Feb 12 '23 at 18:27
I think my problem is with my dataframe, which I brought in from a csv. I still have this "If you are using all scalar values, you must pass an index." My data frame did not come from a list, but from columns imported from csv. Any clues on how to resolve this? – Beavis Feb 12 '23 at 18:58
Hard to tell without the data, code and debug info. Is this error triggered before my code? – mozway Feb 12 '23 at 19:04
@mozway I used this trial data (which is a csv file) https://www.dropbox.com/s/gmaaf3kvtgkp1gm/trial_data.csv?dl=0 And then: `import pandas as pd` `import numpy as np` `df = pd.read_csv("trial_data.csv")` `from itertools import permutations` `out = (pd.crosstab(*zip(*((a, b) for l in df.to_numpy().tolist() for a,b in permutations(l, r=2)))) ).rename_axis(index=None,columns=None)` – Beavis Feb 12 '23 at 19:40
@Beavis I just tried with your file and this worked without issues, what are your python and pandas versions? Can you make sure to use recent versions? – mozway Feb 13 '23 at 06:09
1

You are right! I've never had version problems till now. You're a genius. Thanks for all the help. – Beavis Feb 13 '23 at 17:42

Corralien · Answer 2 · 2023-02-12T19:24:13.583

1

You can use sliding_window_view and networkx to get a weighted adjacency matrix:

from numpy.lib.stride_tricks import sliding_window_view
import networkx as nx

edges = sliding_window_view(df1.values, (len(df1), 2)).reshape(-1, 2)
edges = pd.DataFrame(edges, columns=['source', 'target']).value_counts().reset_index()

G = nx.from_pandas_edgelist(edges, edge_attr='count')
adjmx = nx.to_pandas_adjacency(G, weight='count').astype(int)

Output:

>>> adjmx
             Apple  Marketing  Finance  Argentina  Bosch  China  John Fowler  Messi  Peter Pan  UK  USA
Apple            0          2        1          0      0      1            0      0          0   1    1
Marketing        2          0        0          0      1      0            1      1          1   0    0
Finance          1          0        0          0      0      0            1      0          0   0    0
Argentina        0          0        0          0      1      0            0      0          0   0    0
Bosch            0          1        0          1      0      0            0      0          0   0    0
China            1          0        0          0      0      0            0      0          0   0    0
John Fowler      0          1        1          0      0      0            0      0          0   0    0
Messi            0          1        0          0      0      0            0      0          0   0    0
Peter Pan        0          1        0          0      0      0            0      0          0   0    0
UK               1          0        0          0      0      0            0      0          0   0    0
USA              1          0        0          0      0      0            0      0          0   0    0

edited Feb 12 '23 at 19:24

answered Feb 12 '23 at 17:46

Corralien

109,409
8
28
52

could you edit your solution? The matrix is unbalanced; individuals are left out of the rows. This looks like a really elegant solution! – Beavis Feb 12 '23 at 17:59
A sliding window view will only show adjacent pairs, not all combinations, this gives an incorrect output :( – mozway Feb 12 '23 at 18:34
@Beavis. Are you sure you are looking for a (weighted) adjacency matrix? – Corralien Feb 12 '23 at 18:40
@Corralien, yes, a weighted matrix. The output give me rows with repeated row headings. – Beavis Feb 12 '23 at 18:49
How USA could be connected to Marketing if you want an adjacency matrix? – Corralien Feb 12 '23 at 19:16
`networkx` can return a real weighted adjacency matrix. Can you check the output, you will see it's very different of what you expect. – Corralien Feb 12 '23 at 19:26
@Corralien, the code returned an error. `G = nx.from_pandas_edgelist(edges, edge_attr='count')` I changed it to: `G = nx.from_pandas_edgelist(edges, edge_attr=None)` It ran and it returned a balanced matrix, but unweighted. – Beavis Feb 12 '23 at 19:57
Try with `G = nx.from_pandas_edgelist(edges, edge_attr=True)`. You need `edge_attr` to use `count` column as edge attribute to get the weight. Don't forget `.value_counts().reset_index()`? – Corralien Feb 12 '23 at 20:02
Still unweighted :( – Beavis Feb 12 '23 at 20:11
What is your version of `networkx`? `print(nx.__version__)` – Corralien Feb 12 '23 at 20:11
The version is 2.6.3. – Beavis Feb 12 '23 at 20:14
I tested with your version and it works perfectly. Does `edges` dataframe is correct: 3 columns: source, target and count columns? – Corralien Feb 12 '23 at 20:25
I tried your approach again with updated python libraries, but I am still having some issues as it produces a non-weighted matrix. Updated my python to the latest version too. Not sure what's up, but thanks for introducing this method - which gets me inducted into networkx. Cheers – Beavis Feb 14 '23 at 06:17
I hope you will solve this problem. Remember that your expected output is not what is called "a weighted adjacency matrix" :-) – Corralien Feb 14 '23 at 07:26

create adjacency matrix from unique values from multiple columns

2 Answers2

corresponding graph: