How to vectorize json data for KMeans?

Question

I have a number of questions and choices which users are going to answer. They have the format like this:

question_id, text, choices

And for each user I store the answered questions and selected choice by each user as a json in mongodb:

{user_id: "",  "question_answers" : [{"question_id": "choice_id", ..}] }

Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.

kmeans data sample and my desired output:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.

I created a key for each question_choice combination like this:

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)

And I try to transform each of my user's question/choice pairs into this:

v.transform({'question_1_choice_2': 1, ...})

And I get a result like this:

[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?

Thanks in advance. Sorry I'm new to data science.

What is your reading format ? How do you read your data ? What is the type ? — eliasah, Aug 23 '17 at 09:27
@eliasah I'll be reading it from mongodb which is json. This way I'll need to load the question and choices to generate the vectorizer first, then go through users to transform their data using the vectorizer which I don't think is very efficient. — Amin Alaee, Aug 23 '17 at 09:30
It's a bit hard to relate your JSON data to the K-Means sample data you show. Using K-Means, you need to make sure you are actually dealing with interval or ratio data. If your data is nominal or ordinal, you can't use K-Means. You can, however, use K-Modes, which operates on dissimilarity of nominal or ordinal data. Relevant papers: "Clustering Categorial Data with k-Modes" by Joshua Zhexue Huang and "An empirical comparison of four initialization methods for the K-Means algorithm" by J. M Peña et al. — henrikstroem, Aug 23 '17 at 09:34

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Don't use K-Means with categorical data. Let me quote How to understand the drawbacks of K-means by KevinKim:

k-means assume the variance of the distribution of each attribute (variable) is spherical;

all variables have the same variance;

the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

With encoded categorical data the first two assumptions are almost sure to violated.

For further discussion see K-means clustering is not a free lunch by David Robinson.

I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions

For similarity searches use MinHashLSH with approximate joins:

https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

You'll have to StringIndex and OneHotEncode all variables for that as shown in the following answers :

How to vectorize json data for KMeans?

1 Answers1

Linked