Convert long-form dataframe of pairwise distances to distance matrix in python

Question

I have a pandas dataframe of pairwise distances in the form of:

    SampleA   SampleB  Num_Differences
0  sample_1  sample_2                1
1  sample_1  sample_3                4
2  sample_2  sample_3                8

Note that there are no self-self comparisons (e.g., sample_1 vs sample_1 won't be represented). I would like to convert this table into a squareform distance matrix instead, like so:

            sample_1      sample_2  sample_3
sample_1                       1              4
sample_2         1                            8
sample_3         4             8

Can anyone give me some pointers on how to do such a conversion in python? The problem is analogous to a previous question in R (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question (Convert a distance matrix to a list of pairwise distances in Python).

Some code to reproduce a dataframe in the form I'm using:

df = pd.DataFrame([['sample_1', 'sample_2', 1],
                   ['sample_1', 'sample_3', 4],
                   ['sample_2', 'sample_3', 8]],
                  columns=['SampleA', 'SampleB', 'Num_Differences'])

I'm a bit unclear. The link to the R question just appears to reshape the data, but you appear to be performing some reverse calculation to get 2 and 6 in the output from `1 4 8`. How do you know what resultants to use as there are potentially infinite subtraction operations which could produce distances of `1 4 8`. — Henry Ecker, Sep 12 '21 at 03:03
Apologies for the lack of clarity. I'm not actually wanting to do calculations, just to reshape the data from a 'molten'/long format into a matrix form. Perhaps I'm not using the correct terminology. I've also edited the question to fix up the numbers - they were left over from a more complex example I used originally - oops. — frustrated_bioinformatician, Sep 12 '21 at 03:06

mozway · Accepted Answer · 2021-09-12T05:10:27.880

You can reshape to square, and then make symmetrical by adding the transposed values:

# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))

# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
   .reindex(index=idx, columns=idx)
   .fillna(0, downcast='infer')
   .pipe(lambda x: x+x.values.T)
 )

Alternatively, you can use ordered categorical indexes and keep NAs during reshaping with pivot_table. Then add the transposed values to make symmetrical:

cat = sorted(set(df['SampleA']).union(df['SampleB']))

(df.assign(SampleA=pd.Categorical(df['SampleA'],
                                  categories=cat,
                                  ordered=True),
           SampleB=pd.Categorical(df['SampleB'],
                                  categories=cat,
                                  ordered=True),
           )
    .pivot_table(index='SampleA',
                 columns='SampleB',
                 values='Num_Differences',
                 dropna=False, fill_value=0)
    .pipe(lambda x: x+x.values.T)
)

Output:

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

Peter Leimbigler · Answer 2 · 2021-09-12T20:34:32.880

Precompute an array of unique labels in the original pairwise distances:

idx = pd.concat([df['SampleA'], df['SampleB']]).unique()
idx.sort() 
idx

array(['sample_1', 'sample_2', 'sample_3'], dtype=object)

Pivot, then reindex both the index and columns to introduce zero values in the resulting intermediate DataFrame:

res = (df.pivot('SampleA', 'SampleB', 'Num_Differences')
         .reindex(index=idx, columns=idx)
         .fillna(0)
         .astype(int))
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         0         0         8
sample_3         0         0         0

Add the intermediate DataFrame to its own transpose to produce a symmetric pairwise distance matrix:

res += res.T
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

Thanks! Looks good. Is there a way to have the resulting matrix be symmetrical? (rather than zeroes in the bottom half) — frustrated_bioinformatician, Sep 12 '21 at 09:34
Good point; I've edited my answer to produce a symmetric distance matrix. Looks like I end up with something similar to @mozway's excellent answer! — Peter Leimbigler, Sep 12 '21 at 20:35

Henry Ecker · Answer 3 · 2021-09-12T03:42:29.050

We seem to be converting a weighted edgelist to an adjacency matrix. We can use networkx functions to make this conversion from_pandas_edgelist to adjacency_matrix:

import networkx as nx
import pandas as pd

# Create Graph
G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

# Build adjacency matrix
adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes
)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

We can also fill the diagonal with numpy.fill_diagonal if wanting NaN instead of 0s:

import networkx as nx
import numpy as np
import pandas as pd


G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes,
    dtype=float  # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1       NaN       1.0       4.0
sample_2       1.0       NaN       8.0
sample_3       4.0       8.0       NaN

score 2 · Answer 4 · answered Sep 12 '21 at 03:20

2

 pd.pivot_table(df, values='Num_Differences', index='Sample_A',
                columns='SampleB', aggfunc=max, fill_value=0)

Note that if you don't have more than one instance of the same pair of Sample_A, Sample_B, it doesn't matter much what aggfunc you use; you can use sum, max, min, mode, mean, etc. If having more than one is possible, you might want to consider how you want Pandas to handle that.

answered Sep 12 '21 at 03:20

Acccumulation

3,491
1
8
12

Thanks for the answer. While this shows the differences between samples, I'd prefer to have sample_1, sample_2, and sample_3 as the column and row names. This method shows sample_1 & sample_2 as row names, and sample_2 & sample_3 as columns names. – frustrated_bioinformatician Sep 12 '21 at 09:30

Convert long-form dataframe of pairwise distances to distance matrix in python

4 Answers4