How to format data for the spark mlib kmeans clustering algorithm?

Question

I'm trying to do a kmeans clustering algorithm from apache Spark's mlib library. I have everything setup but I'm not exactly sure how would I go about formatting the input data. I'm relatively new to machine learning so any help would be appreciated. In the sample data.txt the data is as follows: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2

And the data that I want to run the algorithm on is in this format for now (json array):

[{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}]

How can I convert it into something that can be used with the k-means clustering algorithm? I'm using Java, also I'm guessing I need it to be in a JavaRDD format, but have no idea how to go about doing it.

score 3 · Answer 1 · edited May 23 '17 at 12:15

How this works:

First of all, you have to define on what dimensions you would like to apply KMeans, the KMeans example included on Spark documentation is applied on a data set of 3D points (X Y & Z dimensions). take into accoint that the KMeans implementation on MLLib is able to work on sets of n dimensions where n>=1

A Proposal:

So lets say, for your input, the X Y & Z dimensions are going to be the JSON fields: price, time & location. then, all you have to do is to extract those dimensions from your data set and put these in a text file as follows:

300 1411134115000 2
300 1411954672000 2
...
...
...

Where location "bt2" has been replace by 2 (assuming that your data set has another locations). You have to provide numeric values to KMeans.

Notes/Ideas:

For better clustering results and depending on the data time distribution, It would be nice if you take advantage of the timestamp field by transforming it to values: Year , Month , Day , Hour, Minute, Second, etc. So, you could play with different dimensions as separate fields depending on your clustering purpose.

Also, I guess you would like to make automatic JSON2CSV conversion process. So, in your mapping implementation you could use an approach like this: https://stackoverflow.com/a/15411074/833336

How to format data for the spark mlib kmeans clustering algorithm?

1 Answers1