Questions tagged [cluster-analysis]

Cluster analysis is the process of grouping "similar" objects into groups known as "clusters", along with the analysis of these results.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and .

Related topics: , , knowledge discovery, taxonomy. Not to be confused with cluster computing.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Cross Validated, Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?

6244 questions
462
votes
8 answers

Cluster analysis in R: determine the optimal number of clusters

How can I choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis? n = 1000 kk = 10 x1 = runif(kk) y1 = runif(kk) z1 =…
user2153893
  • 4,657
  • 3
  • 13
  • 5
228
votes
10 answers

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
bmasc
  • 2,410
  • 2
  • 15
  • 9
203
votes
20 answers

Difference between classification and clustering in data mining?

Can someone explain what the difference is between classification and clustering in data mining? If you can, please give examples of both to understand the main idea.
154
votes
20 answers

How do I determine k when using k-means clustering?

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
Jason Baker
  • 192,085
  • 135
  • 376
  • 510
120
votes
8 answers

What is an intuitive explanation of the Expectation Maximization technique?

Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. What is an intuitive explanation of this EM technique? What is expectation here and what is being…
109
votes
6 answers

1D Number Array Clustering

So let's say I have an array like this: [1,1,2,3,10,11,13,67,71] Is there a convenient way to partition the array into something like this? [[1,1,2,3],[10,11,13],[67,71]] I looked through similar questions yet most people suggested using k-means…
E.H.
  • 3,271
  • 4
  • 19
  • 18
99
votes
7 answers

Unsupervised clustering with unknown number of clusters

I have a large set of vectors in 3 dimensions. I need to cluster these based on Euclidean distance such that all the vectors in any particular cluster have a Euclidean distance between each other less than a threshold "T". I do not know how many…
59
votes
18 answers

K-means algorithm variation with equal cluster size

I'm looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promising, but does not produce equally sized groups. Is there a variation of this…
pixelistik
  • 7,541
  • 3
  • 32
  • 42
54
votes
2 answers

plotting results of hierarchical clustering on top of a matrix of data

How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure: This is Figure 6 from: A panel of induced pluripotent stem cells from chimpanzees: a…
user248237
53
votes
3 answers

Scikit Learn - K-Means - Elbow - criterion

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with…
Linda
  • 2,375
  • 4
  • 30
  • 33
49
votes
8 answers

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Eeyore
  • 2,126
  • 7
  • 33
  • 49
48
votes
6 answers

How to get the samples in each cluster?

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it? Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in…
user77005
  • 1,769
  • 4
  • 18
  • 26
48
votes
9 answers

scikit-learn: Predicting new points with DBSCAN

I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7): from sklearn.cluster import DBSCAN dbscan = DBSCAN(random_state=0) dbscan.fit(X) However, I found that there was no built-in function (aside from "fit_predict") that could…
slaw
  • 6,591
  • 16
  • 56
  • 109
46
votes
5 answers

Plot dendrogram using sklearn.AgglomerativeClustering

I'm trying to build a dendrogram using the children_ attribute provided by AgglomerativeClustering, but so far I'm out of luck. I can't use scipy.cluster since agglomerative clustering provided in scipy lacks some options that are important to me…
Shukhrat Khannanov
  • 461
  • 1
  • 4
  • 4
44
votes
4 answers

kmeans: Quick-TRANSfer stage steps exceeded maximum

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20). I get the following error: Quick-TRANSfer stage steps exceeded maximum…
Anna Dunietz
  • 825
  • 1
  • 9
  • 18
1
2 3
99 100