1

The question I have is regarding the identification of a series of string in python. Let me explain what I am trying to do:

A string such as tom and jerry could also be written as in lowercase

  1. tom n jerry
  2. tom_jerry
  3. tom & jerry
  4. tom and jerry

and so on and so forth. As you can see there in the minimal example, there were 4 possible ways where even if I created a dictionary with these 3 ways, i will miss out on a string containing tom _ jerry. What can I do to recognize tom and jerry, creating many rules seems very inefficient. Is there a more efficient way to do this ?

Slartibartfast
  • 1,058
  • 4
  • 26
  • 60
  • 2
    Recognizing all possibilities will require artificial intelligence. Maybe a NLP library can do this. – Barmar Aug 26 '22 at 20:07
  • 3
    It really depends on your acceptance criteria. For this specific example you could do `s.startswith("tom") and s.endswith("jerry")` to test a given string `s` and it'd return true for all of the examples. But it would also return true for really huge strings that you might not want to accept, and it would return false on minor mispellings of either `tom` or `jerry`, which you also might not want. – Samwise Aug 26 '22 at 20:08
  • 3
    A better approach might be to compute the Levenshtein distance (which is relatively straightforward) and decide on a particular threshold that is "close enough" for your purposes. – Samwise Aug 26 '22 at 20:09
  • 1
    There are fundamentally two ways to do this: (1) Use NLP. This might be a bit over-the-top depending on your use case. (2) Create a set of rules, such as `s.startswith("tom") and s.endswith("jerry") and len(s) < 15` – Lecdi Aug 26 '22 at 20:13
  • @Lecdi I dont think I can do that because the string is part of sentence, – Slartibartfast Aug 26 '22 at 20:15
  • Does this answer your question? [Find the similarity metric between two strings](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings) – Bohdan Aug 26 '22 at 20:30
  • Find a way of extracting the string to be checked, and perform some tests on it. You will need to decide what tests to use. I would probably do: `s = s.lower().strip(" \t\n_-"); s.startswith("tom") and s.endswith("jerry") and s[3:-5].strip(" \t\n_-'") in ("", "and", "&", "n")` – Lecdi Aug 26 '22 at 20:32

2 Answers2

2

This will find any of those combinations in a sentence:

combo = "tom n jerry"
string = "This is an episode of" + combo + "that deals with something."
substring = string[string.find("tom"):string.find("jerry")+5]
print(substring)
stefan_aus_hannover
  • 1,777
  • 12
  • 13
2

You could attempt this using a sequence matcher.

from difflib import SequenceMatcher

def checkMatch(firstWord: str, secondWord: str, strictness: float):
    ratio = SequenceMatcher(None, firstWord.strip(), secondWord.strip()).ratio()
    if ratio > strictness:
        return 1
    return 2

if __name__ == "__main__":
    originalWord = "tom and jerry"
    toMatch = "tom_jerry" # chose this one as it is the least likely in your example
    toMatch.lower() # easier to match if you lower or upper both the original and the match
    strictness = 0.6 # a strictness of 0.6 would mean the words are generally pretty similiar
    print(checkMatch(originalWord, toMatch, strictness))

You can learn more about how sequence matcher works here: https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc