I have a pairwise distance dataframe that I've made with pandas:
#Get files
import glob
import itertools
one_dimension = glob.glob('*.pdb')
dataframe = []
for combo in itertools.combinations(one_dimension,2):
pdb_1 = combo[0]
pdb_2 = combo[1]
entry = { 'pdb_1' : pdb_1, 'pdb_2', 'rmsd': get_rmsd(pdb_1,pdb_2)
dataframe.append(entry)
import pandas
dataframe = Dataframe(dataframe)
dataframe
All I want to do is cluster the dataframe in such a way where all clusters contain pdbs that are less than some cutoff ( lets say less than 2). I have read that complete linkage is the way to go.
For instance:
- pdb_1,pdb_2 have an rmsd 1.56
- pdb_3,pdb_2 have an rmsd 1.03
- pdb_2, pdb_1 have an rmsd of 1.60
So they are can all appear in a cluster together. But if any new pdb tries to be added to the cluster, if it is > 2 for any member already in the cluster, it will be rejected.
I understand that this is a complete linkage with a cutoff.
I have looked into scipy.cluster.hierarchy.linkage, but I'm having an extremely hard time formatting the array to enter into the linkage.
What is the best way to complete this task?
How do I go from my dataframe to something that can be useable by
scipy.cluster?Should I turn it into an R dataframe?
How do I find out which members are in the cluster if I transform the pairwise distance to an array.
I have found this, this, and this question similar, and found this tutorial
UPDATE
according to the answer by cel, I can get the following:
>>df
and then pivot
pivot_table = df.pivot('pdb_1','pdb_2','rmsd').fillna(0)
>>pivot_table
Then the data array
piv_arr = pivot_table.as_matrix()
dist_mat = piv_arr + np.transpose(piv_arr)
>>dist_mat
But, I can't make a squareform as the diagnals don't equal 0...
>>>squareform(dist_mat)
and can verify
>>dist_mat.diagonal()