Clustering and labeling data set with 4 parameters

Question

This is a loaded question and it's my first 'real life' machine learning experiment so bare with the simplistic questions.

I have USPTO bulk data that looks like this in a CSV file:

Name                     Class  Subclass  Category  Subcategory
Lightpack circuitboard   E        1         4       9
Lego blocks              F        2         56      12
D/C connector            E        3         4       1
Colorful dog hat         D        6         10      1
Grandma's shoes          D        2         11      1
Low temp resistor        O        2         4       10

What I want is to be able to run a supervised machine learning environment to group the common objects (there are many more than this in my actual data but this is a simple example). I want to be able to run through to find a common set of class, subclass, category, and subcategory amongst all electronics and to group them as such into an electronics 'bin' (ie: Lightpack circuitboard, D/C connector, and Low temp resistor) but am unsure how to proceed.

Currently I'm using Python and sklearn to do my more simplistic modeling but am unsure of how to test and train under 4 parameters given and I have no labeled set to compare to (no validation).

Would creating a pseudo-labeled set to make it supervised be more advised or is there an unsupervised approach I could take? As I said before this is my first real test in ML.

Shridhar R Kulkarni · Accepted Answer · 2018-02-16T05:00:14.763

3

Unsupervised algorithms is what you need to go for.(Why so?)

The key concept you need to understand here is what are Multivariate distances and how to calculate them. Then you can apply K-means clustering.

You can also read about PCA and use it. You might need to scale the variables for PCA to work correctly.

edited Feb 16 '18 at 05:00

answered Feb 16 '18 at 04:05

Shridhar R Kulkarni

6,653
3
37
57

In the meantime as I continue to learn is there any basic algorithm you could give me a hint with that could help me begin the process? I find conceptually I understand the ideas - it's a matter of turning those concepts into actual code that I'm having the hurdles with. – HelloToEarth Feb 16 '18 at 05:09
K-means is a basic algo itself when it comes to unsupervised learning. You can find the implementation for it on internet. Just a suggestion, learn k-means with single variable and then go for multivariate. I believe this answers your doubts; if not, let me know. – Shridhar R Kulkarni Feb 16 '18 at 05:13

score 1 · Answer 2 · answered Feb 16 '18 at 08:42

As rightly pointed out you can use any of the Clustering algorithm(K-means or its variant, Hierarchical clustering, EM algorithm. The procedure follows a simple and easy way to classify a data points to certain number of clusters. As number of cluster is not known, for K means you can try with different level of K and use Elbow method to choose one best suitable or hierarchical clustering will allow you to find best k

Clustering and labeling data set with 4 parameters

2 Answers2