How to improve text similarity/classification performance when classes are semantically similar?

Question

I have an NLP classification problem whereby I want to match an input string (a question) to the most suitable string from a list of reference strings (FAQs), or abstain if confidence in a classification is low.

I have an existing function that uses distilbert-base-uncased embeddings and cosine similarity, which performs OK. However, the similarity scores are typically high for all reference strings, which is a consequence of them all being semantically similar. The strings themselves are all on a particular topic (e.g., "What is X?", "How can I prevent X?", "What are the symptoms of X?", "How can I tell if X is happening?"), so this isn't exactly surprising.

What techniques can I use to improve performance here? I do not have any training data, so fine-tuning is out. I can obviously try different language models and similarity measures, but it's difficult to determine whether this is going to have any noticeable impact.

Are there any statistical or additional NLP techniques people can recommend for this problem?

My existing function is as follows:


from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def mapping(user_input: str, abstain_threshold: float,
            language_model='distilbert-base-uncased', faqs_file='faqs.txt'):

    # Load the pre-trained transformer model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(language_model)
    model = AutoModel.from_pretrained(language_model)

    # Load the FAQ list
    with open(faqs_file, 'r') as f:
        faqs = [line.strip() for line in f]

    # Tokenize the user input and FAQs
    user_input_tokens = tokenizer.encode(user_input, add_special_tokens=True)
    faq_tokens = [tokenizer.encode(faq, add_special_tokens=True) for faq in faqs]

    # Pad the tokenized sequences to the same length
    max_len = max(len(tokens) for tokens in faq_tokens + [user_input_tokens])
    user_input_tokens = user_input_tokens + [0] * (max_len - len(user_input_tokens))
    faq_tokens = [tokens + [0] * (max_len - len(tokens)) for tokens in faq_tokens]

    # Convert the tokenized sequences to PyTorch tensors
    user_input_tensor = torch.tensor(user_input_tokens).unsqueeze(0)
    faq_tensors = [torch.tensor(tokens).unsqueeze(0) for tokens in faq_tokens]

    # Pass the user input and FAQs through the transformer model
    with torch.no_grad():
        user_input_embedding = model(user_input_tensor)[0][:, 0, :]
        faq_transformer_embeddings = [model(faq_tensor)[0][:, 0, :] for faq_tensor in faq_tensors]

    # use cosine similarity to get the best match
    faq_similarity_scores = []
    for faq_transformer_embedding in faq_transformer_embeddings:
        similarity = cosine_similarity(user_input_embedding, faq_transformer_embedding)
        print(similarity)
        faq_similarity_scores.append(similarity)

    # Find the most similar FAQ
    max_score_index = np.argmax(faq_similarity_scores)
    max_score = faq_similarity_scores[max_score_index]
    best_match = faqs[max_score_index]
    
    #  check if model abstains
    if max_score >= abstain_threshold:
        return best_match
    else:
        return None

You might want to look into [langchain](https://langchain.readthedocs.io/en/latest/use_cases/question_answering.html) — RJ Adriaansen, Mar 02 '23 at 11:28
Thanks @RJAdriaansen -- I'm not sure this will work though as it's not really a question-answer type task (as I understand Q&A at least). Due to the topic, I only have a list of potential FAQs with prescribed answers (rather than generating an answer from a larger corpus) — cookie1986, Mar 02 '23 at 11:44
You can do multiple things with langchain. You could just do a vector search along the lines you're implementing now. But since you take questions as input it seems you want to provide answers from the faq data. Langchain allows you to take the top n most relevant text fragments and use this as input for a prompt to a generative language model (eg GPT3 or a model from the huggingface hub; eg. "provide an answer to question x based on these text fragments:"), which then returns a single answer. Then the similarity issue is not a key problem anymore — RJ Adriaansen, Mar 02 '23 at 12:01
Are you using `distilbert-base-uncased` without any finetuning? If this is the case, you might want to have a look at: [BERT sentence embeddings from transformers](https://stackoverflow.com/questions/63461262/bert-sentence-embeddings-from-transformers) — cronoik, Mar 08 '23 at 21:05

How to improve text similarity/classification performance when classes are semantically similar?

0 Answers0