0

How to find similar kind of issues for a new unseen issue based on past trained issues(includes summary and description of issue) using natural language processing in python

1 Answers1

0

If I understand you correctly you have a new issue (query) and you want to look up other similar issues (documents) in your database. If so, then what you need is a way to find the similarity between your query and existing documents. And once you have them, you can rank them and select the most relevant ones. One such method that allows you to do this is Latent Semantic Indexing (LSI).

To do this you'll have to construct a document-term matrix. You'll use your existing document and create a term occurrence matrix across documents. What this means is that you basically record how many times a word appears in a document (or some other complex measure, example- tfidf). This can be done either through a bag of words representation or a TFIDF representation.

Once you have that, you'll have to process your query so that it is in the same form as your documents. Now that you have your query in usable form, you can calculate the cosine similarity between documents and your query. The one with the highest cosine similarity is the closest match.

Note: The topic that you may want to read about is Information Retrieval and LSI is just one such method. You should look into other methods as well.

Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • Thanks for the response.. the challenge we are facing is that when we train the data using tfidf we have many text features generated, whereas a new unseen test data has a single record , and the feature generated out of this data is very less in size and hence while doing cosine similarity there is size mismatch and hence unable to calculate the similarity between the new unseen issue and the already trained data. However we are able to calculate cosine similarity if this new unseen data is part of training data.So just wanted to know if it iss possible to get similar issues for unseen data. – Sandeep Agarwal Aug 14 '18 at 08:18
  • The size mismatch can be handled. Transform it using the same vectorizer (by the way, are you using python?). That will bring it in the required form. It will probably be more sparse in nature but you can still calculate the cosine similarity. So, yes, you can use it on unseen data. Just make sure you follow the same preprocessing steps. – Clock Slave Aug 14 '18 at 10:49
  • yes..I am using python and have used the same vectorizer(TfidfVectorizer) and preprocessing for the test data, but got the error while trying to transform the test data by using the fit object created for train data.. – Sandeep Agarwal Aug 14 '18 at 12:14
  • See if this helps: https://stackoverflow.com/questions/42068474/tfidfvectorizer-how-does-the-vectorizer-with-fixed-vocab-deal-with-new-words – Clock Slave Aug 15 '18 at 06:44