3

I'm trying to increase the efficiency of a non-conformity management program. Basically, I have a database containing about a few hundred rows, each row describes a non-conformity using a text field. Text is provided in Italian and I have no control over what the user writes. I'm trying to write a python program using NTLK to detect how many of these rows report the same problem, written differently but with similar content. For example, the following sentences need to be related, with a high rate of confidence

  • I received 10 pieces less than what was ordered
  • 10 pieces have not been shipped

I already found the following article describing how to preprocess text for analysis: How to Develop a Paraphrasing Tool Using NLP (Natural Language Processing) Model in Python

I also found other questions on SO but they all refer to word similarity, two sentences comparison, or comparison using a reference meaning.

In my case, I have no reference and I have multiple sentences that needs to be grouped if they refer to similar problems, so I wonder if this job it's even possible to do with a script.

This answer says that it cannot be done but it's quite old and maybe someone knows something new.

Thanks to everyone who can help me.

2 Answers2

1

Thank's to Anurag Wagh advice I figured it out. I used this tutorial about gensim and how to use it in many ways.

Chapter 18 does what I was asking for, but during my test, I found out a better way to achieve my goal.

Chatper 11 shows how to build an LDA model and how to extract a list of main topics among a set of documents.

Here is my code used to build the LDA model

# Step 0: Import packages and stopwords
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess, lemmatize
from nltk.corpus import stopwords
from gensim import corpora
import re
import nltk
import string
import pattern
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)

docs = [doc for doc in open('file.txt', encoding='utf-8')]

import nltk
import string
import pattern

# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
it_stop_words = it_stop_words + [<custom stop words>]
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()

# the following function is just to get the lemma
# out of the original input word
def lemmatize_word(input_word):
    in_word = input_word
    word_it = pattern.it.parse(
        in_word, 
        tokenize=False,  
        tag=False,  
        chunk=False,  
        lemmata=True 
    )
    the_lemmatized_word = word_it.split()[0][0][4]
    return the_lemmatized_word

# Step 2: Prepare Data (Remove stopwords and lemmatize)
data_processed = []

for doc in docs:
    word_tokenized_list = nltk.tokenize.word_tokenize(doc)
    word_tokenized_no_punct = [x.lower() for x in word_tokenized_list if x not in string.punctuation]
    word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]

    word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
    word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
    data_processed.append(word_tokenized_no_punct_no_sw_no_apostrophe)

dct = corpora.Dictionary(data_processed)
corpus = [dct.doc2bow(line) for line in data_processed]

lda_model = LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=7,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

# save the model
lda_model.save('lda_model.model')

# See the topics
lda_model.print_topics(-1)

With the model trained i can get a list of topic for each new non-conformity and detect if it's related to something already reported by others non-conformities

0

Perhaps converting document to vectors and the computing distance between two vectors would be helpful

doc2vec can be helpful over here

Anurag Wagh
  • 1,086
  • 6
  • 16