1

I just wanted to know if there's a simple way to search a string by coincidence with another one in Python. Or if anyone knows how it could be done.

To make myself clear I'll do an example.

text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")

letters_to_match = ('b','a','g','u','t','e','w','r','d')   #   With just one 'e'
coincidences = sum(text_sample.count(x) for x in letters_to_match)

#    coincidences = 14 Current output
#    coincidences = 10 Expected output

My current method breaks the words_to_match into single characters as in letters_to_match but then it is matched as follows: "baguette is a french word" (coincidences = 14).

But I want to obtain (coincidences = 10) where "baguette is a french word" were counted as coincidences. By checking the similarity between words_to_match and the words in text_sample.

How do I get my expected output?

martineau
  • 119,623
  • 25
  • 170
  • 301
Pomodor0
  • 19
  • 3
  • so you only want the count to include the first occurence of each character? But in your output "e" is the only character that's counted twice. I don't get the logic here – Shubham Periwal Jun 20 '21 at 10:19
  • No, if text_sample was "a baguette is a french word" that first 'a' would be matched as the first occurrence and that's not what I want. I want it done by checking the similarity between words_to_match and the words in the text_sample. – Pomodor0 Jun 20 '21 at 10:27
  • 1
    That sounds very wage to me as well. Is it something in the direction of [edit distance](https://en.wikipedia.org/wiki/Edit_distance) that you are out after? – Dr. V Jun 20 '21 at 10:41
  • Exactly like edit distance, is there a way to do it on python? – Pomodor0 Jun 20 '21 at 10:53
  • 1
    I'm sure you can find a Python implement of a function that calculates the Levenshtein distance or one of the other measurement techniques somewhere (or implement one of them yourself). – martineau Jun 20 '21 at 11:28
  • 1
    @Pomodor0 You might also want to take a look at [difflib](https://docs.python.org/3/library/difflib.html) – MegaIng Jun 20 '21 at 11:56

2 Answers2

1

first, split words_to_match with

    words = ''
    for item in words_to_match:
        words += item
    letters = [] # create a list
    for letter in words:
        letters.append(letter)
    letters = tuple(letters)

then, see if its in it

    x = 0
    for i in sample_text:
        if letters[x] == i:
            x += 1
            coincidence += 1

also if it's not in sequence just do:

    for i in sample_text:
        if i in letters: coincidence += 1

(note that some versions of python you'l need a newline)

jp_
  • 83
  • 9
0

It looks like you need the length of the longest common subsequence (LCS). See the algorithm in the Wikipedia article for computing it. You may also be able to find a C extension which computes it quickly. For example, this search has many results, including pylcs. After installation (pip install pylcs):

import pylcs
text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")
print(pylcs.lcs2(text_sample, ' '.join(words_to_match.join)))  #: 14
pts
  • 80,836
  • 20
  • 110
  • 183