I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing:
import os.path
import numpy
import scipy
import scipy.io
from hcluster import squareform, pdist, linkage, complete
from hcluster.hierarchy import linkage, from_mlab_linkage
from numpy import savetxt
from StringIO import StringIO
data.dmp contains matrix looks like:
A B C D
A 0 1 0 1
B 1 0 0 1
C 0 0 0 0
D 1 1 0 0
and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0
f = file('data.dmp','r')
s = StringIO(f.readline()).getvalue()
f.close()
matrix = numpy.asarray(eval("["+s+"]"))
by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D
sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")
linkage Y
Z = linkage(Y, method="complete")
So, matrix Z is what I need (if I correctly used hcluster?)
But I have next problems:
I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. Please kindly, python guru's suggest how to make it?
To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? This algorithm realy produce correct HAC?
Thank you for reading, I appreciate any help!