2

I have a data set of some questions and answers that users have completed by choices. I'm trying to build a user-user recommendation engine to find similar users based on their answers to the quesitons. An important point is questions are shuffled and are not in an order and data is streaming.

So for each user I have a data like this:

user_1: {"question_1": "choice_1", ...}
user_2: {"question_3": "choice_4", ...}
user_3: {"question_1": "choice_3", ...}

I have found most tutorials to be about user-item recommendations, but nothing about user-user recomenndations.

I've realized Clustering and Cosine Similarity might be some good options and I've found columnSimilarity is very efficient.

rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

mat = RowMatrix(rows)

sims = mat.columnSimilarity()

I have two questions:

Is it wise to define each user as column and question/choices as rows to get the result I need?

And how should I vectorize this kind of data to numbers? If I need to do clustering.

Thanks in advance :)

Amin Alaee
  • 1,895
  • 17
  • 26
  • 2
    columnSimilarity is to be used with skinny and tall matrices, so if you have a user-user matrix on which you wish to perform that task, it won't work. e.g if you have 1M users) – eliasah Aug 23 '17 at 07:33
  • @eliasah Yes thank you for your reply. Just wanted to make sure. So would clustering be a better approach? – Amin Alaee Aug 23 '17 at 07:34

1 Answers1

3

Unfortunately, that's not the way it can be done. It's too good to be true, isn't it ?

columnSimilarity is to be used with skinny and tall matrices, so if you have a user-user matrix on which you wish to perform that task, it won't work. e.g if you have 1M users)

From your description, I see that you have might have a short and wide matrix, columnSimilarity won't work for you.

If you wish to perform UUCF, clustering would be a way to go. (among others, LSH is also a good approach.)

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Thanks. Can you please share an example or a link on how to vectorize this dataset into numbers? – Amin Alaee Aug 23 '17 at 07:42
  • You can find what you need here https://stackoverflow.com/questions/44325555/fit-a-dataframe-into-randomforest-pyspark/44326172#44326172 and https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml/32278617#32278617 – eliasah Aug 23 '17 at 07:52
  • Sorry for the stupid question, I'm new to data science. I need to generate the features array from my question/answers but the link you've posted has some values for features from before. – Amin Alaee Aug 23 '17 at 08:51
  • 1
    I'm not sure I understand your question @MohammadAmin. Would you care opening a new question with some extra information about what is it you are actually doing ? With input data, expected output and what you have tried ? – eliasah Aug 23 '17 at 08:52
  • Thanks. Here's is the [here](https://stackoverflow.com/questions/45835524/how-to-vectorize-this-json-data) – Amin Alaee Aug 23 '17 at 09:17