2

I have a pairwise distance dataframe that I've made with pandas:

#Get files
import glob
import itertools
one_dimension = glob.glob('*.pdb')

dataframe = []
for combo in itertools.combinations(one_dimension,2):
    pdb_1 = combo[0]
    pdb_2 = combo[1]
    entry = { 'pdb_1' : pdb_1, 'pdb_2', 'rmsd': get_rmsd(pdb_1,pdb_2)
    dataframe.append(entry)

import pandas
dataframe = Dataframe(dataframe)
dataframe

enter image description here

All I want to do is cluster the dataframe in such a way where all clusters contain pdbs that are less than some cutoff ( lets say less than 2). I have read that complete linkage is the way to go.

For instance:

  1. pdb_1,pdb_2 have an rmsd 1.56
  2. pdb_3,pdb_2 have an rmsd 1.03
  3. pdb_2, pdb_1 have an rmsd of 1.60

So they are can all appear in a cluster together. But if any new pdb tries to be added to the cluster, if it is > 2 for any member already in the cluster, it will be rejected.

I understand that this is a complete linkage with a cutoff.

I have looked into scipy.cluster.hierarchy.linkage, but I'm having an extremely hard time formatting the array to enter into the linkage.

  • What is the best way to complete this task?

  • How do I go from my dataframe to something that can be useable by
    scipy.cluster?

  • Should I turn it into an R dataframe?

  • How do I find out which members are in the cluster if I transform the pairwise distance to an array.

I have found this, this, and this question similar, and found this tutorial

UPDATE

according to the answer by cel, I can get the following:

>>df

enter image description here

and then pivot

 pivot_table = df.pivot('pdb_1','pdb_2','rmsd').fillna(0)
 >>pivot_table

enter image description here

Then the data array

piv_arr = pivot_table.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)
>>dist_mat

enter image description here

But, I can't make a squareform as the diagnals don't equal 0...

>>>squareform(dist_mat)

enter image description here

and can verify

>>dist_mat.diagonal()

enter image description here

Community
  • 1
  • 1
jwillis0720
  • 4,329
  • 8
  • 41
  • 74
  • The input to the different hierarchical clustering methods is a condensed distance matrix. For creating such a distance matrix from your observations, you may want to have a look at `pdist` http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.pdist.html) – cel Jun 27 '15 at 07:44
  • my input is precomputed distances. I can't create distances of distances – jwillis0720 Jun 27 '15 at 08:27

1 Answers1

1

This might work for you:

These are the imports we need:

import scipy.cluster.hierarchy as hcl
from scipy.spatial.distance import squareform
import pandas as pd
import numpy as np

Let's assume we already calculated the distance matrix and decided to store the upper triangular part of the distance matrix in this format:

data = pd.DataFrame({
    "a": ["a1", "a1", "a2", "a3", "a2", "a1"],
    "b": ["a2", "a3", "a3", "a3", "a2", "a1"],
    "distance": [1,2,3, 0, 0, 0]
})

So this is our data frame:

a   b   distance
0   a1  a2  1
1   a1  a3  2
2   a2  a3  3
3   a3  a3  0
4   a2  a2  0
5   a1  a1  0

Using DataFrame.pivot, we can convert the data frame to a square distance matrix:

data_piv = data.pivot("a", "b", "distance").fillna(0)
piv_arr = data_piv.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)

This will give us:

array([[ 0.,  1.,  2.],
       [ 1.,  0.,  3.],
       [ 2.,  3.,  0.]])

This we can transform into a condensed distance matrix via squareform and feed into the linkage algorithm:

hcl.linkage(squareform(dist_mat))

Which gives us following linkage matrix:

array([[ 0.,  1.,  1.,  2.],
       [ 2.,  3.,  2.,  3.]])
cel
  • 30,017
  • 18
  • 97
  • 117
  • This looks great! Will test first thing in the morning – jwillis0720 Jun 27 '15 at 09:20
  • Can you help me interpret the array at the end? I understand that the first and second number are the nodes, and the third number is the distance between the nodes, and the last number is how many members that cluster contains. But how does that relate to the original data frame – jwillis0720 Jun 28 '15 at 01:12
  • @jwillis0720, The column and row headers of `data_piv` link the cluster id's 1 to n to the corresponding names specified by your original data frame. – cel Jun 28 '15 at 06:05
  • almost! having trouble getting the squareform (see update) – jwillis0720 Jun 28 '15 at 06:42
  • @jwillis0720, hmh, you may want to find the diagonal index that is non-zero and see what the problem is. I can only guess. Note that you have to specify the full upper triangular matrix or else the pivoting may not give you the correct distance matrix. – cel Jun 28 '15 at 06:45