Classification algorithm for data that is only mostly consistent

Question

I'm have a data set that consists of approx 30 features all of which except one are similar and one that is a category (the result of a preprocessing step to generate clusters)

Each cluster is generally a similar set of features of similar number values but there are also often some outliers – see below.

For example: - Features labeled A,B,C… ect

Note: I have converted the NAN in the data to the number 0.

A   B   C   D   E   F   G   H   …>  Cluster 
78  0   0   67  48  35  0   0       1   
0   67  0   66  45  35  0   0       1   
0   0   0   68  44  38  0   0       1   
0   0   0   66  43  36  0   0       1   
78  50  67  0   0   0   0   0       2   
75  55  60  0   0   0   0   0       2   
77  54  61  0   0   78  0   0       2

Question: I need to be able to feed in a new feature set (single row) and predict the cluster number. What will be the best classification algorithm for this task given that there are these outliers the data and only mostly similar?

Seems off topic for this site, but look into k-means clustering. A simple thing would be to compute the Euclidean distance between the new row and each of the clusters (maybe the centroid of the points) and classify it to the closest cluster. — pault, Jan 28 '18 at 21:55
Thx @pault , Questions: 1. There seems to be many different ways and tools to compute Euclidean, which one do you think works best with higher dimensional data at scale of ~100,000 rows of points? 2. How do I create a centroid for each cluster? — Mat, Jan 29 '18 at 07:10
Both of these questions can be answered via google search. For 1, try [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy). For 2, start with the simple average across all dimensions. You may also want to look into (google) clustering algorithms and recommender systems. — pault, Jan 29 '18 at 20:00
Thx @pault, appreciate the help and sorry if I used the wrong forum for a more general question. — Mat, Jan 29 '18 at 23:44

score 0 · Answer 1 · answered Jan 29 '18 at 23:48

0

Thx @pault for the pointer to: "compute the Euclidean distance between the new row and each of the clusters (maybe the centroid of the points) and classify it to the closest cluster."

answered Jan 29 '18 at 23:48

Mat

39
5

Classification algorithm for data that is only mostly consistent

1 Answers1