I have a data set of some questions and answers that users have completed by choices. I'm trying to build a user-user recommendation engine to find similar users based on their answers to the quesitons. An important point is questions are shuffled and are not in an order and data is streaming.
So for each user I have a data like this:
user_1: {"question_1": "choice_1", ...}
user_2: {"question_3": "choice_4", ...}
user_3: {"question_1": "choice_3", ...}
I have found most tutorials to be about user-item recommendations, but nothing about user-user recomenndations.
I've realized Clustering and Cosine Similarity might be some good options and I've found columnSimilarity is very efficient.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
mat = RowMatrix(rows)
sims = mat.columnSimilarity()
I have two questions:
Is it wise to define each user as column and question/choices as rows to get the result I need?
And how should I vectorize this kind of data to numbers? If I need to do clustering.
Thanks in advance :)