I have an NLP classification problem whereby I want to match an input string (a question) to the most suitable string from a list of reference strings (FAQs), or abstain if confidence in a classification is low.
I have an existing function that uses distilbert-base-uncased
embeddings and cosine similarity, which performs OK. However, the similarity scores are typically high for all reference strings, which is a consequence of them all being semantically similar. The strings themselves are all on a particular topic (e.g., "What is X?", "How can I prevent X?", "What are the symptoms of X?", "How can I tell if X is happening?"), so this isn't exactly surprising.
What techniques can I use to improve performance here? I do not have any training data, so fine-tuning is out. I can obviously try different language models and similarity measures, but it's difficult to determine whether this is going to have any noticeable impact.
Are there any statistical or additional NLP techniques people can recommend for this problem?
My existing function is as follows:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def mapping(user_input: str, abstain_threshold: float,
language_model='distilbert-base-uncased', faqs_file='faqs.txt'):
# Load the pre-trained transformer model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(language_model)
model = AutoModel.from_pretrained(language_model)
# Load the FAQ list
with open(faqs_file, 'r') as f:
faqs = [line.strip() for line in f]
# Tokenize the user input and FAQs
user_input_tokens = tokenizer.encode(user_input, add_special_tokens=True)
faq_tokens = [tokenizer.encode(faq, add_special_tokens=True) for faq in faqs]
# Pad the tokenized sequences to the same length
max_len = max(len(tokens) for tokens in faq_tokens + [user_input_tokens])
user_input_tokens = user_input_tokens + [0] * (max_len - len(user_input_tokens))
faq_tokens = [tokens + [0] * (max_len - len(tokens)) for tokens in faq_tokens]
# Convert the tokenized sequences to PyTorch tensors
user_input_tensor = torch.tensor(user_input_tokens).unsqueeze(0)
faq_tensors = [torch.tensor(tokens).unsqueeze(0) for tokens in faq_tokens]
# Pass the user input and FAQs through the transformer model
with torch.no_grad():
user_input_embedding = model(user_input_tensor)[0][:, 0, :]
faq_transformer_embeddings = [model(faq_tensor)[0][:, 0, :] for faq_tensor in faq_tensors]
# use cosine similarity to get the best match
faq_similarity_scores = []
for faq_transformer_embedding in faq_transformer_embeddings:
similarity = cosine_similarity(user_input_embedding, faq_transformer_embedding)
print(similarity)
faq_similarity_scores.append(similarity)
# Find the most similar FAQ
max_score_index = np.argmax(faq_similarity_scores)
max_score = faq_similarity_scores[max_score_index]
best_match = faqs[max_score_index]
# check if model abstains
if max_score >= abstain_threshold:
return best_match
else:
return None