3

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.

But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.

  • Pete and Rob have found a dog near the station.
  • Pete and Rob have never found a dog near the station

And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.

What I wanted ?

Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.

Points to be noted

  • Find from fixed set of statements but user inputted statements can differ
  • Must work for statements
PSKP
  • 1,178
  • 14
  • 28
FocusNow
  • 33
  • 4

3 Answers3

3

I think it is not gensim embedding. It is word2vec embedding. Whatever it is.

You need tensorflow_hub

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.

It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here

Code

import tensorflow_hub as hub
import numpy as np

# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5" 
model = hub.load(module_url)

# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]

# find high-dimensional vectors.
vecs = model(data)

# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)

# print dists
print(dists)

Output

array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)

Conclusion

First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.

Using this, you can find distance between given input and statements in db. then rank them according to distance.

Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
PSKP
  • 1,178
  • 14
  • 28
1

You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.

import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

def get_syn_list(gword):
  syn_list = []
  try:
    syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
    syn_list.extend(wn.synsets(gword,pos=wn.VERB))
    syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
    syn_list.extend(wn.synsets(gword,pos=wn.ADV))
  except :
    print("Something Wrong Happened")
  syn_words = []
  for i in syn_list:
    syn_words.append(i.lemmas()[0].name())
  return syn_words

Now use split and split your statements in db. like this

stat = ["display classes"]

syn_dict = {}
for i in stat:
   tmp = []
   for x in i.split(" "):
       tmp.extend(get_syn_list(x))
   syn_dict[i] = set(tmp)

Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.

BeOpen
  • 43
  • 6
1

Hey you can use spacy

This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c

import spacy

nlp =  spacy.load("en_core_web_lg")

doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))

Output

0.6277548513279427

Edit

Run following command, which will download model.

!python -m spacy download en_core_web_lg
PP-56
  • 19
  • 3
  • Hey When I installed spacy and run above code I got error `Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.` – FocusNow Jun 01 '20 at 13:02
  • While this code may resolve the OP's issue, it is best to include an explanation as to how your code addresses the OP's issue. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform, that differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation. – SherylHohman Jun 01 '20 at 22:58
  • Keep in mind, that while links are welcome, all posts on SO must be self-contained, in case the content of the page changes, or the link becomes unavailable. Also user experience suffers when users are *required* to hop aground the web to piece together an "answer". Simply include any relevant explanations from the link into your post. In this way, UX is great, and links provide a welcome addendum for follow up, confirmation, or deeper dives. – SherylHohman Jun 01 '20 at 23:03
  • @FocusNow I updated solution. Your issue is solved. – PP-56 Jun 02 '20 at 03:29