3

I want to check if a string contains all of the substring's words and retains their order; at the moment I am using the following code; However it is very basic, seems inefficient and likely there is a much better way of doing it. I'd really appreciate if you could tell me what a more efficient solution would be. Sorry for a noob question, I am new to the programming and wasn't able to find a good solution

def check(main, sub_split):
    n=0
    while n < len(sub_split):
        result = True
        if sub_split[n] in main:
            the_start =  main.find(sub_split[n])
            main = main[the_start:]

        else:
            result=False
        n += 1
    return result

a = "I believe that the biggest castle in the world is Prague Castle "
b= "the biggest castle".split(' ')

print check(a, b)

update: interesting; First of all thank you all for your answers. Also thank you for pointing out some of the spots that my code missed. I have been trying different solutions posted here and in the links, I will add update how they compare and accept the answer then.

update: Again thank you all for great solutions, every one of them had major improvements compared to my code; I checked the suggestions with my requirements for 100000 checks and got the following results; suggestions by: Padraic Cunningham - consistently under 0.4 secs (though gives some false positives when searching for only full words; galaxyan - 0.65 secs; 0.75 secs friendly dog - 0.70 secs John1024 - 1.3 secs (Highly accurate, but seems to take extra time)

temo
  • 363
  • 1
  • 4
  • 11
  • 1
    All you need is [`all`](https://docs.python.org/2/library/functions.html#all) :))) `print all(x in a for x in b)` – Nir Alfasi Aug 31 '16 at 23:18
  • If you split `main` into a list as well, you can use apply [this](http://stackoverflow.com/a/24017747) answer. –  Aug 31 '16 at 23:22
  • 2
    @alfasin all seems to return True even if the order of words is different; example a = "I castle believe that biggest the in the world is Prague Castle " b= "the biggest castle".split(' ') is returned as True, it should be false though – temo Aug 31 '16 at 23:30
  • alfasin's code should be `it = iter(a); print all(x in it for x in b)` to account for order. See the [Q/A](http://stackoverflow.com/a/24017747) I linked earlier. –  Aug 31 '16 at 23:34
  • for this particular example, you can drop the `.split` and just do `b in a` – Copperfield Aug 31 '16 at 23:37
  • 1
    @Copperfield that would only check if `b` is a _substring_ of `a`, as opposed to a _subsequence_. The words in `b` do not have to appear contiguously in `a`. –  Aug 31 '16 at 23:40
  • @temo, the problem you describe is called "sublist" or "subsequence". Apart from the "trick" of using regex, the underlying type (string here) is unimportant. – Elazar Aug 31 '16 at 23:50
  • If you want to retain order as well, then Dog's version (here in the comments) should do the trick! – Nir Alfasi Sep 01 '16 at 16:26

3 Answers3

2

You can simplify your search by passing the index of the previous match + 1 to find, you don't need to slice anything:

def check(main, sub_split):
    ind = -1
    for word in sub_split:
        ind = main.find(word, ind+1)
        if ind == -1:
            return False
    return True

a = "I believe that the biggest castle in the world is Prague Castle "
b= "the biggest castle".split(' ')

print check(a, b)

If ind is ever -1 then you get no match after so you return False, if you get thorough all the words then all words are in the string in order.

For exact words you could do something similar with lists:

def check(main, sub_split):
    lst, ind = main.split(), -1
    for word in sub_split:
        try:
           ind = lst.index(word, ind + 1)
        except ValueError:
            return False
    return True

And to handle punctuation, you could first strip it off:

from string import punctuation

def check(main, sub_split):
    ind = -1
    lst = [w.strip(punctuation) for w in main.split()]
    for word in (w.strip(punctuation) for w sub_split):
        try:
           ind = lst.index(word, ind + 1)
        except ValueError:
            return False
    return True

Of course some words are valid with punctuation but that is more a job for nltk or you may actually want to find matches including any punctuation.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • @John1024, true but since the OP was using *in* I presumed substrings were ok to match, if not then a regex with word boundaries or an iterator based approach may be needed. There is also the case of punctuation and case but again the OP does not seem to have that in their own code which may be an oversight. – Padraic Cunningham Sep 01 '16 at 00:01
  • 1
    You are right: the OP was using `in`. In that case, +1. – John1024 Sep 01 '16 at 00:04
  • 1
    @John1024, well between your regex and this they should get what they need. – Padraic Cunningham Sep 01 '16 at 00:05
1

Let's define your a string and reformat your b string into a regex:

>>> a = "I believe that the biggest castle in the world is Prague Castle "
>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in "the biggest castle".split(' ')) + r'\b'

This tests to see if the words in b appear in the same order in a:

>>> import re
>>> bool(re.search(b, a))
True

Caveat: If speed is important, a non-regex approach may be faster.

How it works

The key thing here is the reformulation of the string into a regex:

>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in "the biggest castle".split(' ')) + r'\b'
>>> print(b)
\bthe\b.*\bbiggest\b.*\bcastle\b

\b matches only at word boundaries. This means, for example, that the word the will never be confused with the word there. Further, this regex requires that all the words be present in the target string in the same order.

If a contains a match to the regex b, then re.search(b, a) returns a match object. Otherwise, it returns None. Thus, bool(re.search(b, a)) returns True only if a match was found.

Example with punctuation

Because word boundaries treat punctuation as not word characters, this approach is not confused by punctuation:

>>> a = 'From here, I go there.'
>>> b = 'here there'
>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in b.split(' ')) + r'\b'
>>> bool(re.search(b, a))
True
John1024
  • 109,961
  • 14
  • 137
  • 171
  • This is a very interesting trick. Yet it is not useful since it is less efficient and less readable than the alternatives. – Elazar Aug 31 '16 at 23:53
  • I find the regex perfectly clear and readable. On the other hand, if you post a superior answer, I will be happy to upvote it. – John1024 Aug 31 '16 at 23:59
  • Given the answers [here](http://stackoverflow.com/questions/17996414/data-structure-for-subsequence-queries) it might be the case that regex _is_ the fast way. – Elazar Sep 01 '16 at 00:02
-1

if you just want to check whether there is a word contain in other string, no need to check all. You just need to find one and return true.
when you check the item set is faster O(1)(average)

a = "I believe that the biggest castle in the world is Prague Castle "
b = "the biggest castle"

def check(a,b):
    setA,lstB = set( a.split() ), b.split() 
    if len(setA) < len(lstB): return False 
    for item in lstB:
        if item in setA:
            return True
    return False

print check(a,b)

if you donot care the speed

def check(a,b):
    setA,lstB = set( a.split() ), b.split() 
    return len(setA) >= len(lstB) and any( 1 for item in lstB if item in setA) 

for speed and time complexity: link

galaxyan
  • 5,944
  • 2
  • 19
  • 43