3

I'm trying to use google flan t5-large to create embeddings for a simple semantic search engine. However, the generated embeddings cosine similarity with my query is very off. Is there something I'm doing wrong?

import torch
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean

tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')
model = AutoModel.from_pretrained('google/flan-t5-large')

# Set the text to encode

def emebddings_generate(text):
  all_embeddings = []
  for i in text:
    input_ids = tokenizer.encode(i, return_tensors='pt')
    with torch.no_grad():
      embeddings = model(input_ids, decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)
      all_embeddings.append((embeddings,i))
  return all_embeddings

def run_query(query,corpus):
  input_ids = tokenizer.encode(query, return_tensors='pt')
  with torch.no_grad():
        quer_emebedding=model(input_ids,decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)

  similairtiy = []

  for embeds in corpus:
    sim = euclidean(embeds[0].flatten(),quer_emebedding.flatten())
    similairtiy.append((embeds[1],float(sim)))
  return similairtiy


text = ['some sad song', ' a very happy song']
corpus = emebddings_generate(text)

query = "I'm feeling so sad rn"
similairtiy = run_query( query,corpus)
for i in similairtiy:
  print(i)
  print(i[1],i[0])

I've tried different pooling techniques as well as using other distance metrics.

cronoik
  • 15,434
  • 3
  • 40
  • 78
Affan Mir
  • 31
  • 2
  • Maybe try something simpler? e.g. https://github.com/alvations/cliffjumper – alvas Mar 08 '23 at 19:05
  • 1
    @alvas the problem is that my corpus does not consist of sentences. It actually has large multi-paragraphed documents. A sentence transformer is performing poorly – Affan Mir Mar 09 '23 at 09:16

1 Answers1

3

The problem you face here is that you assume that FLAN's sentence embeddings are suited for similarity metrics, but that isn't the case. Jacob Devlin wrote once regarding BERT:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors.

But that isn't an issue, because FLAN is intended for other use cases. It was trained on different datasets with a suitable instruction prompt for that task to allow zero-shot prompting (i.e. performing tasks the model hasn't seen been trained on). That means you could perform your similarity task by formulating a proper prompt without any training. For example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

prompt = """Which song fits the query.
QUERY: I'm feeling so sad rn 
OPTIONS 
-some sad song 
-a very happy song"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids  
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

some sad song

Depending on your use case you might face issues when the number of options increases or when you want to work with the sentence embeddings. If this is the case, you should have a look at sentence-transformers. These are transformers that were trained to produce meaningful sentence embeddings and can therefore be used to calculate the cosine similarity of two sentences.

cronoik
  • 15,434
  • 3
  • 40
  • 78
  • Thank you that makes sense ! would it be possible to train an LLM keeping similarity metric as the loss function, would the embeddings learnt then be used for semantic searching? – Affan Mir Mar 09 '23 at 09:07
  • Yes, that's possible. Just checkout the sentence=transformer training section if you want to look deeper into it: [url](https://www.sbert.net/docs/training/overview.html). – cronoik Mar 09 '23 at 17:38