1

I have a dataframe in pandas of organisation descriptions and project titles, shown below:enter image description here

Columns are df['org_name'], df['org_description'], df['proj_title']. I want to add a column with the similarity score between the organisation description and project title, for each project(each row).

I'm trying to use gensim: https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html. However, I'm not sure how to adapt the tutorial for my use case, because in the tutorial we get a new query doc = "Human computer interaction" and then compared that against the documents in the corpus individually. Not sure where this choice is made (sims? vec_lsi?)

But I want the similarity score for just the two items in a given row of dataframe df, not one of them against the whole corpus, for each row and then append that to df as a column. How can I do this?

Mobeus Zoom
  • 598
  • 5
  • 19
  • The tutorial attached is for querying against a corpus(collection of texts) using LSI (Latent Semantic Indexing). If you want to perform doc-doc similarity, there are more appropriate algorithms to do that. – thorntonc Aug 07 '20 at 17:59
  • @thorntonc feel free to update/replace your answer with a different algorithm if it'd be better. For example I found this: https://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model. Could be that all that's necessary is some way of applying the functions here? (e.g. see post by 'eng.mrgh') – Mobeus Zoom Aug 07 '20 at 18:52

1 Answers1

1

Here is an adaptation of the Gensim LSI tutorial, where the description represents a corpus of sentences and the title is the query made against it.

from gensim.models import LsiModel
from collections import defaultdict
from gensim import corpora

def desc_title_sim(desc, title):
    # remove common words and tokenize
    stoplist = set('for a of the and to in'.split())  # add a longer stoplist here
    sents = desc.split('.')  # crude sentence tokenizer
    texts = [
        [word for word in sent.lower().split() if word not in stoplist]
        for sent in sents
    ]

    # remove words that appear only once
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    texts = [
        [token for token in text if frequency[token] > 1]
        for text in texts
    ]

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]

    lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)

    vec_bow = dictionary.doc2bow(title.lower().split())
    vec_lsi = lsi[vec_bow]  # convert the query to LSI space
    return vec_lsi

Apply the function row-wise to get similarity:

df['sim'] = df.apply(lambda row: desc_title_sim(row['org_description'], row['proj_title']), axis=1)

The newly created sim column will be populated with values like

[(0, 0.4618210045327158), (1, 0.07002766527900064)]
thorntonc
  • 2,046
  • 1
  • 8
  • 20
  • thanks! However, I want the corpus to remain constant (I think)--it would represent the full set of descriptions and titles (without tokenizing into sentences). So for the ```documents``` I'm taking in all the elements of ```df['org_description']``` and ```df['proj_title']```. I would use this as a basis of the vector space and get the similarity between a given description and title using the vectors for each. – Mobeus Zoom Aug 07 '20 at 18:33
  • isn't this algorithm the right way to do it? If not, how would you do it? (That's to say I want doc-doc similarity, but to compare similarity scores to one another I'd think the doc-doc similarity has to be computed without changing the axes every time right--so ```documents``` has to stay the same. Also there's not really enough text/info within one doc to constitute a corpus for comparison) – Mobeus Zoom Aug 07 '20 at 18:35
  • @Mobeus Zoom those are some good points, I'm not seeing any Gensim documentation/examples that work with those parameters, I'd try searching deeper into the documentation to modify my answer or look for alternative methods for on doc-doc similarity. – thorntonc Aug 07 '20 at 18:51
  • yes please feel free to suggest alternatives. The link I pointed to above (https://stackoverflow.com/questions/22433884/python-gensim-how-to-calculate-document-similarity-using-the-lda-model) could be promising? but I'm not yet familiar at all with Gensim as you can probably tell – Mobeus Zoom Aug 07 '20 at 18:53