how to use pyspark's mllib's gmm while dealing with categorical variables?

Question

I am trying to cluster a large dataset using MLlib's gmm implementation. The problem is that my dataset has categorical inputs, which are being converted to floats within gmm's train function; so I am afraid that the algorithm is not treating the categorical data as categorical data but rather as continuous data. When I tried passing alphanumeric strings as training to the gmm's train function, it threw an type error saying it could not convert the given string to float. Are there ways of dealing with this problem of clustering categorical data using gmm's mllib implementation, or alternatively are there other clustering algorithms in mllib that enable clustering with categorical variables? rdd=sc.textFile('s3n://msd.data.test/sud/new_cls122016-04-26') # rdd1=rdd.map(lambda x:[x.split(',')[0],x.split(',')[1],x.split(',')[2],x.split(',')[3],x.split(',')[4],x.split(',')[5],x.split(',')[6],x.split(',')[7],x.split(',')[8]]) gmm=GaussianMixture.train(rdd1, 35,seed=10) label=gmm.predict(rdd1)

rdd1 is the training data with columns 0 to 6 being integers and, 7 and 8 being categorical variables.

`

I don't know about MLlib, but I have an unrelated tip. You can simplify your `map` statement to `import csv; rdd.map(lambda x: csv.reader(x)[:8])`; [see here](http://stackoverflow.com/a/36408724/6157047) for more explanation. — Galen Long, Apr 26 '16 at 21:09
no i am not looking to read t as a csv, i am trying to see if there is a way of handling categorical data while using gmm — user2233120, Apr 27 '16 at 08:46

score 1 · Answer 1 · answered Apr 27 '16 at 22:29

1

Gaussian distributions are only defined on continuous variables.

Because the normal (gaussian) distribution is continuous.

So encoding your categorial attributes into continuous variables probsbly the best that you can do besides ignoring them.

answered Apr 27 '16 at 22:29

Has QUIT--Anony-Mousse

76,138
12
138
194

how to use pyspark's mllib's gmm while dealing with categorical variables?

1 Answers1