1

I have been trying to build a beer recommendation engine, I have decided to make it simply using tf-idf and Cosine similarity .

Here is my code so far: `

import pandas as pd     
import re
import numpy as np 
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
wnlzer = WordNetLemmatizer()


train = pd.read_csv("labeledTrainData.tsv" , header = 0 ,  \
    delimiter = '\t' , quoting  = 3)


def raw_string_to_list_clean_string( raw_train_review ):
    remove_html = BeautifulSoup( raw_train_review ).text
    remove_punch = re.sub('[^A-Za-z ]' , "" , remove_html)
    token = remove_punch.lower().split()
    srm_token = [wnlzer.lemmatize(i) for i in token if not i in set(stopwords.words('english'))]
    clean_text = " ".join(srm_token)
    return(clean_text)

ready_train_list = []
length  = len(train['review'])
for i in range(0 , length):
    if (i%100 == 0):
        print "doing  %d of  %d of training data set" % (i+1 , length)
    a = raw_string_to_list_clean_string(train['review'][i])
    ready_train_list.append(a)

vectorizer = TfidfVectorizer(analyzer = "word" , tokenizer = None , preprocessor = None , \
    stop_words = None , max_features = 20000)
training_our_vectorizer = vectorizer.fit_transform(ready_train_list)`

Now I know how to use cosine similarity but I am not able to figure out:

  1. how to make use of cosine
  2. how to restrict the recommendation to a max of 5 beers
fragilewindows
  • 1,394
  • 1
  • 15
  • 26
Anurag Pandey
  • 373
  • 2
  • 5
  • 21
  • What do you mean by 'how to use cosine'? You suppose to use it to find similarity between users or between items. regarding your second question - the simple answer is 'top-5'. But to be more precise, you need to find a list of items to recommend on, and it suppose to be sorted from the best match to the worst match - then present to the user only the top 5. – Gal Dreiman Aug 07 '16 at 07:19
  • i mean when i am using cosine similarity with one vs the rest it is giving a very nice matrix . like for ex if i use this for the first row then it gives [1,0.5,0.23,0.045,.........] , now i understand what this matrix is representing but how do i go about using this ? – Anurag Pandey Aug 07 '16 at 07:43
  • I;m sorry for the question: do you know something about Collaborative_filtering (https://en.wikipedia.org/wiki/Collaborative_filtering)? This link can give some vital imformation. But anyway - for your question - for a given user, you should choose K most similar users (you can extract that from you Cosine matrix) and than predict for all items what would be the rating that given user would have given to if he/she has to rate them. Then all you have to do is to pick the top 5 from that predicted rating list (meaning that those items are the 'probably' most favorite). – Gal Dreiman Aug 07 '16 at 08:00
  • I believe `sklearn` already includes the functionality. If you want to understand what it does, perhaps this can help: http://stackoverflow.com/a/27504795/874188 – tripleee Aug 07 '16 at 08:26

1 Answers1

0

A simple implementation would be to compute the distance to each of the other beers using cdist, and then return your recommendations using argsort:

from scipy.spatial.distance import cdist
import numpy as np

vec = TfidfVectorizer()
beerlist = np.array(['heinekin lager', 'corona lager', 'heinekin ale', 'budweiser lager'])
beerlist_tfidf = vec.fit_transform(beerlist).toarray()
beer_tfidf = vec.transform(['heinekin lager']).toarray()
rec_idx = cdist(beer_tfidf, beerlist_tfidf, 'cosine').argsort()
print(beerlist[rec_idx[0][1:]])

#['heinekin ale' 'corona lager' 'budweiser lager']
maxymoo
  • 35,286
  • 11
  • 92
  • 119