-2

I have list of thousand of strings sub_strings, and list of millions of strings strings

I want to check if strings[i] have substring from sub_strings or substring that's different in 1 character

sub_strings = ['hello']
strings =     ['hell dude whats good',
               'hllo',
               'hallo',
               'hello',
               'dude whats good']

is_substring_no_more_then_1_differnce(strings , sub_strings)

expected

[True, True, True, True, False]
itay k
  • 42
  • 6
  • 1
    And what did you try yourself? – 0stone0 Oct 10 '22 at 16:05
  • 1
    So you want to split each `strings` element into words, then get the levenshtein distance of each of those words for each substring in `sub_strings` list. I would expect the number of levenshtein calls is going to be into the billions (1000 substrings * millions of `strings` * average number of words in each string) to process this whole program. That feels a little heavy handed. – JNevill Oct 10 '22 at 16:05

1 Answers1

1

Here's a route for solving:

from Levenshtein import distance as lev

sub_strings = ['hello', 'bob']
strings =     ['hell dude whats good',
               'hllo',
               'hallo',
               'hello',
               'dude whats good',
               'bobo wants some food']

distance_list = []
for sentence in strings:
    distance_list.append(min([lev(word, substring) for word in sentence.split() for substring in sub_strings]))

print([x <= 1 for x in distance_list])

That will spit out [True, True, True, True, False, True]

But this is going to get very slow as you add elements to either list. Every individual word inside a string inside strings has to be checked against every word in substrings. That's a lot of checking when you have millions of words in each string in strings and thousands of subtrings.

JNevill
  • 46,980
  • 4
  • 38
  • 63
  • Thanks for the answer, is there a way to do it more effectively? – itay k Oct 10 '22 at 16:34
  • 1
    Not to my knowledge. At the root of this is the need to check every word against every other word to get the edit-distance/levenshtein distance, and therefore you have to a loop inside a loop inside a loop. Perhaps there is something you can do to edit this list down before checking edit-distance, but if edit-distance<=1 is the only condition you can use to determine true/false, then you are stuck running billions of times. – JNevill Oct 10 '22 at 16:36
  • 1
    I'm no expert here, but there may be a path to run this multi-threaded so you aren't running billions of calculations in serial on a single thread. [There may be some ideas here](https://stackoverflow.com/questions/5236364/how-to-parallelize-list-comprehension-calculations-in-python). PySpark also comes to mind. First though, I would run this with ever-increasing sized lists and see how where the breaking point is before getting into hairy subjects like multithreading and parallelization. Maybe billions of checks runs in a few hours and maybe that's a reasonable amount of time for your needs. – JNevill Oct 10 '22 at 16:41
  • thanks that's helped me alot if I want to know more about strings algorithms there a good source that you can suggest me? I just new student that want to learn more – itay k Oct 10 '22 at 16:43
  • 1
    I don't know any great resources off the top of my head, but a search for common string matching algorithms like Levenshtein, Damerau-Levenshtein, Jaro, Jaro Winkler, Hamming, n-gram, Jacard, Soundex, Metaphone, etc will probably bring up lots of info on Google. None of them are complicated and could be coded from scratch pretty simply, but knowing about them and how they're used (soundex, for instance is an algorithm to see if two english words sound the same when spoken) is exceptionally useful. – JNevill Oct 10 '22 at 17:27