-3

I am trying to cluster a large dataset using MLlib's gmm implementation. The problem is that my dataset has categorical inputs, which are being converted to floats within gmm's train function; so I am afraid that the algorithm is not treating the categorical data as categorical data but rather as continuous data. When I tried passing alphanumeric strings as training to the gmm's train function, it threw an type error saying it could not convert the given string to float. Are there ways of dealing with this problem of clustering categorical data using gmm's mllib implementation, or alternatively are there other clustering algorithms in mllib that enable clustering with categorical variables? rdd=sc.textFile('s3n://msd.data.test/sud/new_cls122016-04-26') # rdd1=rdd.map(lambda x:[x.split(',')[0],x.split(',')[1],x.split(',')[2],x.split(',')[3],x.split(',')[4],x.split(',')[5],x.split(',')[6],x.split(',')[7],x.split(',')[8]]) gmm=GaussianMixture.train(rdd1, 35,seed=10) label=gmm.predict(rdd1)

rdd1 is the training data with columns 0 to 6 being integers and, 7 and 8 being categorical variables.

`

user2233120
  • 1
  • 1
  • 1
  • I don't know about MLlib, but I have an unrelated tip. You can simplify your `map` statement to `import csv; rdd.map(lambda x: csv.reader(x)[:8])`; [see here](http://stackoverflow.com/a/36408724/6157047) for more explanation. – Galen Long Apr 26 '16 at 21:09
  • no i am not looking to read t as a csv, i am trying to see if there is a way of handling categorical data while using gmm – user2233120 Apr 27 '16 at 08:46

1 Answers1

1

Gaussian distributions are only defined on continuous variables.

Because the normal (gaussian) distribution is continuous.

So encoding your categorial attributes into continuous variables probsbly the best that you can do besides ignoring them.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194