I am trying to cluster a large dataset using MLlib's gmm implementation. The problem is that my dataset has categorical inputs, which are being converted to floats within gmm's train function; so I am afraid that the algorithm is not treating the categorical data as categorical data but rather as continuous data. When I tried passing alphanumeric strings as training to the gmm's train function, it threw an type error saying it could not convert the given string to float. Are there ways of dealing with this problem of clustering categorical data using gmm's mllib implementation, or alternatively are there other clustering algorithms in mllib that enable clustering with categorical variables?
rdd=sc.textFile('s3n://msd.data.test/sud/new_cls122016-04-26')
#
rdd1=rdd.map(lambda x:[x.split(',')[0],x.split(',')[1],x.split(',')[2],x.split(',')[3],x.split(',')[4],x.split(',')[5],x.split(',')[6],x.split(',')[7],x.split(',')[8]])
gmm=GaussianMixture.train(rdd1, 35,seed=10)
label=gmm.predict(rdd1)
rdd1 is the training data with columns 0 to 6 being integers and, 7 and 8 being categorical variables.
`