32

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in.

The papers (in PDF) are: "Collaborative Topic Modeling for Recommending Scientific Articles" and "Collaborative Topic Modeling for Recommending GitHub Repositories"

The new algorithm is called collaborative topic regression. I was hoping to find some python code that implemented this but to no avail. This might be a long shot but can someone show a simple python example?

smac89
  • 39,374
  • 15
  • 132
  • 179
jxn
  • 7,685
  • 28
  • 90
  • 172
  • 6
    There are several Python packages for topic modelling listed at https://www.cs.princeton.edu/~blei/topicmodeling.html. –  Aug 25 '15 at 23:50
  • In C++, [there is ctr](https://github.com/Blei-Lab/ctr). – kamalbanga Jan 25 '16 at 12:57
  • 2
    The repository in kamalbanga's link above uses the first paper you mentioned. Although it is written in C++, you can [call it from python](http://stackoverflow.com/questions/145270/calling-c-c-from-python). – jtitusj Mar 05 '16 at 06:48
  • Please take a look at link in the the answer bellow, there is a python code example - given by scikit-learn.org web site - which fit exactly your need. Regards – A. STEFANI Oct 03 '16 at 17:14
  • The best package for this is `gensim`, which you can very easily `pip install`. Here's the topic page: https://radimrehurek.com/gensim/tut2.html. Re. your actual question, looks like... oh no wait I found it. – Eugene Oct 12 '16 at 20:39
  • Did my answer below answer your question? – Eugene Oct 18 '16 at 20:50

2 Answers2

6

This should get you started (although not sure why this hasn't been posted yet): https://github.com/arongdari/python-topic-model

More specifically: https://github.com/arongdari/python-topic-model/blob/master/ptm/collabotm.py

class CollaborativeTopicModel:
    """
    Wang, Chong, and David M. Blei. "Collaborative topic 
                                modeling for recommending scientific articles."
    Proceedings of the 17th ACM SIGKDD international conference on Knowledge
                                discovery and data mining. ACM, 2011.
    Attributes
    ----------
    n_item: int
        number of items
    n_user: int
        number of users
    R: ndarray, shape (n_user, n_item)
        user x item rating matrix
    """

Looks nice and straightforward. I still suggest at least looking at gensim. Radim has done a fantastic job of optimizing that software very well.

Eugene
  • 1,539
  • 12
  • 20
0

A very simple LDA implementation using gensin. You can find more informations here: https://radimrehurek.com/gensim/tutorial.html

I hope it can help you

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import RSLPStemmer
from gensim import corpora, models
import gensim

st = RSLPStemmer()
texts = []

doc1 = "Veganism is both the practice of abstaining from the use of animal products, particularly in diet, and an associated philosophy that rejects the commodity status of animals"
doc2 = "A follower of either the diet or the philosophy is known as a vegan."
doc3 = "Distinctions are sometimes made between several categories of veganism."
doc4 = "Dietary vegans refrain from ingesting animal products. This means avoiding not only meat but also egg and dairy products and other animal-derived foodstuffs."
doc5 = "Some dietary vegans choose to wear clothing that includes animal products (for example, leather or wool)." 

docs = [doc1, doc2, doc3, doc4, doc5]

for i in docs:

    tokens = word_tokenize(i.lower())
    stopped_tokens = [w for w in tokens if not w in stopwords.words('english')]
    stemmed_tokens = [st.stem(i) for i in stopped_tokens]
    texts.append(stemmed_tokens)

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model using gensim  
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, u'0.066*animal + 0.065*, + 0.047*product + 0.028*philosophy'), (1, u'0.085*. + 0.047*product + 0.028*dietary + 0.028*veg')]