8

I have a pandas dataframe of pairwise distances in the form of:

    SampleA   SampleB  Num_Differences
0  sample_1  sample_2                1
1  sample_1  sample_3                4
2  sample_2  sample_3                8

Note that there are no self-self comparisons (e.g., sample_1 vs sample_1 won't be represented). I would like to convert this table into a squareform distance matrix instead, like so:

            sample_1      sample_2  sample_3
sample_1                       1              4
sample_2         1                            8
sample_3         4             8    

Can anyone give me some pointers on how to do such a conversion in python? The problem is analogous to a previous question in R (Converting pairwise distances into a distance matrix in R), but I don't know the corresponding python functions to use. The problem also appears to be the opposite of this question (Convert a distance matrix to a list of pairwise distances in Python).

Some code to reproduce a dataframe in the form I'm using:

df = pd.DataFrame([['sample_1', 'sample_2', 1],
                   ['sample_1', 'sample_3', 4],
                   ['sample_2', 'sample_3', 8]],
                  columns=['SampleA', 'SampleB', 'Num_Differences'])
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
  • I'm a bit unclear. The link to the R question just appears to reshape the data, but you appear to be performing some reverse calculation to get 2 and 6 in the output from `1 4 8`. How do you know what resultants to use as there are potentially infinite subtraction operations which could produce distances of `1 4 8`. – Henry Ecker Sep 12 '21 at 03:03
  • Apologies for the lack of clarity. I'm not actually wanting to do calculations, just to reshape the data from a 'molten'/long format into a matrix form. Perhaps I'm not using the correct terminology. I've also edited the question to fix up the numbers - they were left over from a more complex example I used originally - oops. – frustrated_bioinformatician Sep 12 '21 at 03:06

4 Answers4

4

You can reshape to square, and then make symmetrical by adding the transposed values:

# make unique, sorted, common index
idx = sorted(set(df['SampleA']).union(df['SampleB']))

# reshape
(df.pivot(index='SampleA', columns='SampleB', values='Num_Differences')
   .reindex(index=idx, columns=idx)
   .fillna(0, downcast='infer')
   .pipe(lambda x: x+x.values.T)
 )

Alternatively, you can use ordered categorical indexes and keep NAs during reshaping with pivot_table. Then add the transposed values to make symmetrical:

cat = sorted(set(df['SampleA']).union(df['SampleB']))

(df.assign(SampleA=pd.Categorical(df['SampleA'],
                                  categories=cat,
                                  ordered=True),
           SampleB=pd.Categorical(df['SampleB'],
                                  categories=cat,
                                  ordered=True),
           )
    .pivot_table(index='SampleA',
                 columns='SampleB',
                 values='Num_Differences',
                 dropna=False, fill_value=0)
    .pipe(lambda x: x+x.values.T)
)

Output:

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0
mozway
  • 194,879
  • 13
  • 39
  • 75
3
  1. Precompute an array of unique labels in the original pairwise distances:
idx = pd.concat([df['SampleA'], df['SampleB']]).unique()
idx.sort() 
idx

array(['sample_1', 'sample_2', 'sample_3'], dtype=object)
  1. Pivot, then reindex both the index and columns to introduce zero values in the resulting intermediate DataFrame:
res = (df.pivot('SampleA', 'SampleB', 'Num_Differences')
         .reindex(index=idx, columns=idx)
         .fillna(0)
         .astype(int))
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         0         0         8
sample_3         0         0         0
  1. Add the intermediate DataFrame to its own transpose to produce a symmetric pairwise distance matrix:
res += res.T
res

SampleB   sample_1  sample_2  sample_3
SampleA                               
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
3

We seem to be converting a weighted edgelist to an adjacency matrix. We can use networkx functions to make this conversion from_pandas_edgelist to adjacency_matrix:

import networkx as nx
import pandas as pd

# Create Graph
G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

# Build adjacency matrix
adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes
)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1         0         1         4
sample_2         1         0         8
sample_3         4         8         0

We can also fill the diagonal with numpy.fill_diagonal if wanting NaN instead of 0s:

import networkx as nx
import numpy as np
import pandas as pd


G = nx.from_pandas_edgelist(
    df,
    source='SampleA',
    target='SampleB',
    edge_attr='Num_Differences'
)

adjacency_df = pd.DataFrame(
    nx.adjacency_matrix(G, weight='Num_Differences').todense(),
    index=G.nodes,
    columns=G.nodes,
    dtype=float  # Compatible dtype with NaN is needed
)
# Overwrite the values on the diagonal
np.fill_diagonal(adjacency_df.values, np.NaN)

adjacency_df:

          sample_1  sample_2  sample_3
sample_1       NaN       1.0       4.0
sample_2       1.0       NaN       8.0
sample_3       4.0       8.0       NaN
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
2
 pd.pivot_table(df, values='Num_Differences', index='Sample_A',
                columns='SampleB', aggfunc=max, fill_value=0)

Note that if you don't have more than one instance of the same pair of Sample_A, Sample_B, it doesn't matter much what aggfunc you use; you can use sum, max, min, mode, mean, etc. If having more than one is possible, you might want to consider how you want Pandas to handle that.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
  • Thanks for the answer. While this shows the differences between samples, I'd prefer to have sample_1, sample_2, and sample_3 as the column and row names. This method shows sample_1 & sample_2 as row names, and sample_2 & sample_3 as columns names. – frustrated_bioinformatician Sep 12 '21 at 09:30