I have a number of questions and choices which users are going to answer. They have the format like this:
question_id, text, choices
And for each user I store the answered questions and selected choice by each user as a json in mongodb:
{user_id: "", "question_answers" : [{"question_id": "choice_id", ..}] }
Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.
kmeans data sample and my desired output:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.
I created a key for each question_choice combination like this:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)
And I try to transform each of my user's question/choice pairs into this:
v.transform({'question_1_choice_2': 1, ...})
And I get a result like this:
[[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?
Thanks in advance. Sorry I'm new to data science.