How to use sparse matrix in python hcluster?

Question

I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO

data.dmp contains matrix looks like:

and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal so data.dmp contains : 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?)

But I have next problems:

I want to use sparse matrix for the huge amount of input data, cause it's time consuming to generate input data like now, I need to import data to python from another language, thats why I need read text file. Please kindly, python guru's suggest how to make it?
To people that used python hcluster, I need to process huge amount of data, hundreds of rows, it's possible to do in hcluster? This algorithm realy produce correct HAC?

Thank you for reading, I appreciate any help!

I can't imagine how this code can work, as written. For a start, `import scipy.io from hcluster` should be `from hcluster import scipy.io`. The first alternative is not syntactic. — hughdbrown, Dec 05 '10 at 23:37
Oh yes, you are right. I rewrite import lines. Cause formatting in ruby style firstly :) — Daniel, Dec 06 '10 at 12:26

score 2 · Answer 1 · answered Jan 17 '11 at 19:27

Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.

Compute the Y matrix yourself, not using the hcluster.pdist. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute.

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage.

How to use sparse matrix in python hcluster?

1 Answers1