-1

I wanted to output if there is any similar sentence present in a txt file

Example:
If the .txt file contains

1 . What is the biggest planet of our Solar system?
2 . How to make tea?
3 . Which our Solar system's biggest planet?

In this case it should result:-
3 . Which our Solar system's biggest planet?

Basically it should compare if there is more than 4 or 5 words which is similar in the lines of the file

  • 1
    So you want to output the two sentences which are most similar? Is there one sentence per line? And what metric of similarity are you using? Levenstein? Some sort of sentence embedding? – Nathan Mar 08 '19 at 02:45
  • 1
    The [difflib](https://docs.python.org/3.7/library/difflib.html) module might help, especially the function `difflib.get_close_matches` – John Coleman Mar 08 '19 at 02:47
  • @Nathan yes it's only one sentence per line. I'm actually reading a txt file which contains many questions. –  Mar 08 '19 at 02:48
  • @JohnColeman let me check it out –  Mar 08 '19 at 02:51
  • @JohnColeman in this the words are already mentioned or predefined. Which is not the case in mine –  Mar 08 '19 at 02:54
  • Re your edit to the question: do you mean that the distance metric between sentences is the number of words in either sentence which don't occur in the other sentence? – Nathan Mar 08 '19 at 02:56
  • You will have to do something like find close matches for each word, and then use that to find the closest overall match. "closest match" itself is pretty vague. Depending on how you flesh it out, `difflib` might not help all that much. It was just a guess on my part. – John Coleman Mar 08 '19 at 02:56
  • @Nathan it's that it shouldn't have questions with similar meaning, FYI I'm very new to python –  Mar 08 '19 at 03:01
  • This part isn't really a Python question, just a design question about what makes sentences similar in your application. "Similar meaning" is hard to encode, but the answer given will probably work well. See [this SO question](https://stackoverflow.com/questions/6690739/fuzzy-string-comparison-in-python-confused-with-which-library-to-use) for a comparison between `difflib` and the Levenstein metric I mentioned. – Nathan Mar 08 '19 at 03:06
  • Possible duplicate of [Find similar sentences in between two documents and calculate similarity score for each section in whole documents](https://stackoverflow.com/questions/40247413/find-similar-sentences-in-between-two-documents-and-calculate-similarity-score-f) – chickity china chinese chicken Mar 08 '19 at 03:08
  • [Check this dive, this is my project](https://drive.google.com/folderview?id=1mFcfJjRL0MeNxTIuPy_6HRnC8Hp_EDt8) –  Mar 08 '19 at 03:24
  • @Nathan I would say similar sentence, chuck the similar meaning. –  Mar 08 '19 at 03:26

1 Answers1

1

I agree with John Coleman's suggestion. difflib can help you find similarity metric between two string. Here's one of the possible approaches:

from difflib import SequenceMatcher

sentences = []
with open('./bp.txt', 'r') as f:
    for line in f:
        # only consider lines that have numbers at the beginning
        if line.split('.')[0].isdigit():
            sentences.append(line.split('\n')[0])
max_prob = 0
similar_sentence = None
length = len(sentences)
for i in range(length):
    for j in range(i+1,length):
        match_ratio = SequenceMatcher(None, sentences[i], sentences[j]).ratio()
        if  match_ratio > max_prob:
            max_prob = match_ratio
            similar_sentence = sentences[j]
if similar_sentence is not None:
    print(similar_sentence)

VietHTran
  • 2,233
  • 2
  • 9
  • 16
  • 1
    You might be able to make this faster using a ball tree or kd tree (e.g. from sklearn). I believe at least ball tree accepts a custom distance metric. Otherwise, you need to do n^2 comparisons for a file with n sentences. No idea how big the files are in this application, but that gets bad quickly – Nathan Mar 08 '19 at 03:10
  • Its not quite working [Check this dive, this is my project](https://drive.google.com/folderview?id=1mFcfJjRL0MeNxTIuPy_6HRnC8Hp_EDt8) –  Mar 08 '19 at 03:22
  • @Nathan Thank you for the suggestion. I'm not really familiar with ball tree so I didn't use in my answer. Given OP's example and link to the project, there are not a lot of sentences to compare so I think the current approach should be sufficient. – VietHTran Mar 08 '19 at 03:49
  • @Appries I didn't know the text file has empty lines and headers. I've added the if condition `if line.split('.')[0].isdigit():` to filter only the sentences with numbers at the beginning. – VietHTran Mar 08 '19 at 03:56