0

I want to use the dendogram of scipy. I have the following data:

I have a list with seven different means. For example:

Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]

Each mean is calculate for a different user. For example:

X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]

My aim is to display the data described above with the help of a dendorgram.

I tried the following:

Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]
X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]

# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)

Z = linkage(Y)
# Plot the dendogram with the results above
dendrogram(Z, leaf_rotation=45., leaf_font_size=12. , show_contracted=True)
plt.style.use("seaborn-whitegrid")
plt.title("Dendogram to find clusters")
plt.ylabel("Distance")
plt.show()

But it says:

ValueError: Length n of condensed distance matrix 'y' must be a binomial coefficient, i.e.there must be a k such that (k \choose 2)=n)!

I already tried to convert my data into a matrix. With:

# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)

But that doesn´t work too!

Are there any suggestions?

Thanks :-)

Jannik
  • 965
  • 2
  • 12
  • 21
  • what I have understood by reading the documentation so far is `y must be a (n|2) sized vector`. I tried your code and if `len(Y)` is 15 it works. Trying to figure out why it won't work for anything less than that. – Vikash Singh Jan 28 '18 at 12:15
  • Sounds weird. But this is actually a use case, isn't it? It should be possible to build clusters of users which have the "same" mean. – Jannik Jan 28 '18 at 12:26

2 Answers2

11

The first argument of linkage is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. These are two very different meanings! The first is the raw data, i.e. the observations. The second format assumes that you have already computed all the distances between your observations, and you are providing these distances to linkage, not the original points.

It looks like you want the first case (raw data), with m = 1. So you must reshape the input to have shape (n, 1).

Replace this:

Z = linkage(Y)

with:

Z = linkage(np.reshape(Y, (len(Y), 1)))
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

So you are using 7 observations in Y len(Y) = 7.

But as per documentation of Linkage, the number of observations len(Y) should be such that.

{n \choose 2} = len(Y)

which means

1/2 * (n -1) * n = len(Y)

so length of Y should be such that n is a valid integer.

Vikash Singh
  • 13,213
  • 8
  • 40
  • 70
  • Your are right. I extended my Y and it works. Unfortunately, the dendogram looks weird. There are only single clusters. What is the right way to extend my Y? I tried to extend it with the same values again.. like `Y.extend(Y)` – Jannik Jan 28 '18 at 12:45
  • 1
    Instead of calculating the mean I used the original values. Thus, my Y is bigger. This worked for me :-) Thanks a lot – Jannik Jan 28 '18 at 14:51
  • The meaning of the first argument of `linkage` depends on the number of dimensions of the argument. When it is one-dimensional, the values are interpreted as the pairwise distances between the points, stored in the condensed arrangement, *not* as the points (i.e. observations) themselves. It is not correct to say that the number of *observations* should be `{n \choose 2}`. – Warren Weckesser Jan 28 '18 at 18:34