0

I have a dataframe that contains Product ID and Sensors from different stations and Lines of production with values (1: the product passes through the sensor/ or 0: there is no relation between the product and the sensor). Here is a part of the dataframe:

enter image description here

I want to use a clustering methods that can cluster the products in products families according to the process (the sensors).

Thank you for your help

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Welcome to StackOverflow. Please include a small sample of your dataframe along with your desired results. Take a look at [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Shubham Sharma Jun 30 '20 at 11:17

1 Answers1

0

Since you do not have labels, we need an unsupervised clustering method.

An example could be Kmeans. Below I provide an example.

import numpy as np
np.random.seed(0)
from sklearn.cluster import KMeans

# build fake data with only 0/1 values in the features
X = np.ones((100,10))
random_indices_rows = np.random.randint(1,100,50) 
X[random_indices_rows]=0

print(X.shape)
#(100, 10) # 100 samples and 10 variables/sensors

# the clustering model
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_

print(kmeans.labels_)

#array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
#       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
#       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
#       1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0,
#       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thank you for your answer. As I see here, I should know exactly how many clusters I want. Is there a way to cluster without a predefined number of clusters ? – Yosr Cheikh Jun 30 '20 at 13:27
  • you need to pre-define the number of clusters. the Elbow method is the gold standard way to estimate the best number of clusters – seralouk Jun 30 '20 at 14:04