2

I need a wise advice from Stack Overflow again. I'm not sure the title is properly showing what I am wondering right now.

The thing is this.

there is two groups of words, and I need to know if a string has one(or more) word in group A while it also has a word in group B. Like... this.

Group_A = ['nice','car','by','shop']
Group_B = ['no','thing','great']

t_string_A = 'there is a car over there'
t_string_B = 'no one is in a car'

t_string_A has 'car' from Group_A, while nothing from Group_B, so it must return... I don't know, let's say 0 while t_string_B has 'car' from Group_A, and 'no' from Group_B, so it should return 1

Actually I was doing this job by somewhat... primitive way. Like bunch of sets of codes like

if 'nice' in t_string_A and 'no' in t_string_A:
    return 1

But as you know, as the length of Group A or B increases, I should make too many sets of those sets. And this is certainly not efficient.

I appreciate your help and attention :D Thanks in advance!

Jeong In Kim
  • 373
  • 2
  • 12
  • If I understand correctly, the words at the same indexes in `GroupA` and `GroupB` are being checked against `t_string_A` and `t_string_B` respectively? – Devesh Kumar Singh Apr 30 '19 at 05:28
  • See this discussion: https://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python You can make a trie for A and one for B and test for membership. – dgumo Apr 30 '19 at 05:45
  • @DeveshKumarSingh Oh in real codes, it is in a loop so, yeah. I it should be checked respectively :D – Jeong In Kim Apr 30 '19 at 05:47
  • Okay, check if my answer helps you then @JeongInKim and upvote/accept it if helps you :) – Devesh Kumar Singh Apr 30 '19 at 06:05

5 Answers5

5

you could work with sets:

Group_A = set(('nice','car','by','shop'))
Group_B = set(('no','thing','great'))

t_string_A = 'there is a car over there'
t_string_B = 'no one is in a car'

set_A = set(t_string_A.split())
set_B = set(t_string_B.split())

def test(string):
    s = set(string.split())
    if Group_A & set_A and Group_B & set_A:
        return 1
    else:
        return 0

what should be the result if there are no words from Group_A and Group_B?

depending on your phrases the test may be more efficient this way:

def test(string):
    s = string.split()
    if any(word in Group_A for word in s) and any(word in Group_B for word in s):
        return 1
    else:
        return 0
hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • Oh it should return 0 as well :D But I think your answer has solved my problem already haha. I'll apply it in my code soon! – Jeong In Kim Apr 30 '19 at 05:35
  • What is the `&` operator which is applied between a list and a set? Never seen this! – Devesh Kumar Singh Apr 30 '19 at 05:39
  • @DeveshKumarSingh `&` is the same as `self.intersection(other)`. the link to the doc explains it. – hiro protagonist Apr 30 '19 at 05:46
  • good solution, but i think we just need only one word form each group to get the same result as this , so no need to go this costly operation. one word is enough , for more fast. – sahasrara62 Apr 30 '19 at 05:46
  • Damn! Thanks for the tidbit @hiroprotagonist, I learned something new today :) – Devesh Kumar Singh Apr 30 '19 at 05:48
  • @hiroprotagonist it is efficient. but i am saying for the same output we need only 1 word most from each group. for set it is costly operation that's it. – sahasrara62 Apr 30 '19 at 05:49
  • 1
    @prashantrana added a variant. this should address your concerns, right? – hiro protagonist Apr 30 '19 at 05:54
  • Ahhh So close! T_T Actually I'm working on in Korean. And looks like Korean grammer has some issue with split()... there is a word 'Noun 입니다' . but split() only identify '입니다' only so... any 입니다 attached with Noun is not classified as 입니다. – Jeong In Kim Apr 30 '19 at 06:00
  • In English term, hmm... I need to check 'thing' including 'something' 'anything' etc. ... hmm.... – Jeong In Kim Apr 30 '19 at 06:01
  • @hiroprotagonist that is it, i have also posted a solution, what i wanted to say – sahasrara62 Apr 30 '19 at 06:02
1

You can use itertools.product to generate all possible pairs of words from the give groups. You then iterate through the list of strings, and if a pair is present in the string, the result is True, otherwise result is False.

import itertools as it

Group_A = ['저는', '저희는', '우리는']
Group_B = ['입니다','라고 합니다']

strings = [ '저는 학생입니다.', '저희는 회사원들 입니다.' , '이 것이 현실 입니다.', '우리는 배고파요.' , '우리는 밴디스트라고 합니다.']

#Get all possible combinations of words from the group
z = list(it.product(Group_A, Group_B))

results = []

#Run through the list of string
for s in strings:
    flag = False
    for item in z:
        #If the word is present in the string, flag is True
        if item[0] in s and item[1] in s:
            flag = True
            break
    #Append result to results string
    results.append(flag)

print(results)

The result will then look like

[True, True, False, False, True]

In addition for the inputs below

Group_A = ['thing']
Group_B = ['car']
strings = ['there is a thing in a car', 'Nothing is in a car','Something happens to my car']

The values will be [True, True, True]

Community
  • 1
  • 1
Devesh Kumar Singh
  • 20,259
  • 5
  • 21
  • 40
  • Thanks Devesh. But I found an issue with split() since I'm actually using Korean. It should be like... Group_A = ['thing'] Group_B = ['car'] string_A = 'there is a thing in a car' string_B = 'Nothing is in a car' string_C = 'Something happens to my car' If I use split(), as you know, Nothing or Something in string_B and C is not classified as 'thing' in Group_A. Is there any other way to check the word in a whole sentence? I want string_A, B, C is returning True :D – Jeong In Kim Apr 30 '19 at 06:11
  • But should there be a `Group_C` for `string_C` ? Or do we have a mapping of groups versus strings ? How do I know which of `Group_A` and `Group_B` will map to `String_C` @JeongInKim ? – Devesh Kumar Singh Apr 30 '19 at 06:16
  • Oh. I should have cleared this. Group_A and Group_B is just for references and strings are in a loop. So I want to check every strings in a loop respectively, if they has a word from Group_A and other words in a Group_B simultaneously. – Jeong In Kim Apr 30 '19 at 06:26
  • `if they has a word from Group_A and other words in a Group_B simultaneously`, what does that mean? All strings in the loop should have words from the group? Or are the strings picked in pairs, then compared against groups and only if each pair satisfied the condition for Group_A being in StringX and Group_B being in StringY for a (StringX, StringY) pair? – Devesh Kumar Singh Apr 30 '19 at 06:30
  • One String is picked by a loop, so nothing to do with pairs (I showed a two strings to show some examples of True and False). And for each picked string, I should check if they have a word from Group_A and the other word from Group_B at the same time. I'm sorry for the confusions. – Jeong In Kim Apr 30 '19 at 06:36
  • Okay let me add to my answer! Does the accepted answer handle this scenario @JeongInKim – Devesh Kumar Singh Apr 30 '19 at 06:37
  • Thanks Devesh. Actually, no. But I wasn't clear about Korean stuffs, and actually it handles Enlgish so I chose it as an answer. But I am willing to change my acceptance or making new question if you can help me here – Jeong In Kim Apr 30 '19 at 06:41
  • Can you share some of your Korean text here? Let me take a crack at it, also split splits on whitespace, does your Korean text not have whitespace in between words? – Devesh Kumar Singh Apr 30 '19 at 06:43
  • Also correct me if I am wrong but for `Group_A = ['thing']` and `Group_B = ['car']` and `strings = [ 'there is a thing in a car' , 'Nothing is in a car' , 'Something happens to my car']`, the answer will be `True,False,False` ? Since only `there is a thing in a car` has `thing` and `car` in the sentence? Also why do you need two groups, when both groups check in the same string? Why not have one group called `['thing', 'car']` ? – Devesh Kumar Singh Apr 30 '19 at 06:44
  • Sure. I really appreciate your passion to help me! strings should return True True True, cuz it has thing in Nothing and Something as well. That's why I'm having problem with split() Let's try this in Korean `Ref_A = ['저는', '저희는', '우리는'] Ref_B = ['입니다','라고 합니다',] ex_A = '저는 학생입니다.' ex_B = '저희는 회사원들 입니다.' ex_C = '이 것이 현실 입니다.' ex_D = '우리는 배고파요.' ex_E = '우리는 밴디스트라고 합니다.'` The return should be 'T, T, F, F, T' ! – Jeong In Kim Apr 30 '19 at 06:52
  • Aah now I understand, you want to check in part of words too, and Is it case sensitive or case insensitive? Not sure about case in Korean though – Devesh Kumar Singh Apr 30 '19 at 07:18
  • Also will the `groups` be of unequal size like the example here, where `Ref_A` is of size 3 and Ref_B is of size 2? – Devesh Kumar Singh Apr 30 '19 at 07:25
  • Yes ! And Case insensitive please since there isn't any capital words in Korean. Also it will be more helpful for me if I try it in English :D And Also yes, those groups are much likely to have different sizes. (So maybe having some problem with zips? haha... Idk). So in examples, the string should be checked with 3 * 2 sets of combination. – Jeong In Kim Apr 30 '19 at 07:37
  • Got it, and if any combination in contained in the string, the result should be true? – Devesh Kumar Singh Apr 30 '19 at 07:39
  • Precisely Correct :D !! – Jeong In Kim Apr 30 '19 at 07:42
  • Okay I updated my answer! I used your example, and the answer I get is what you expected as well @JeongInKim – Devesh Kumar Singh Apr 30 '19 at 07:49
  • Ahhh You are such a life savor Devesh! It works! I really appreciate it! Now I can add more words to the reference without concerning adding bunch of duplicate codes haha. Also I may learn from your code and apply it other contexts as well! Sincerely, thanks! – Jeong In Kim Apr 30 '19 at 07:58
  • I would suggest to be more verbose and precise the next time you ask a question :) – Devesh Kumar Singh Apr 30 '19 at 08:19
  • Sure thing! I will! I'm so sorry for holding you up so much. – Jeong In Kim Apr 30 '19 at 08:48
  • @Cireo Thanks Cireo! That would be my concerns as well, but each group only contains less than... 20? So it's good for my code! But I'm studying other codes as well thanks to others! – Jeong In Kim May 08 '19 at 12:04
1
Group_A = ['nice','car','by','shop']
Group_B = ['no','thing','great']

from collections import defaultdict

group_a=defaultdict(int)
group_b=defaultdict(int)

for i in Group_A:
    group_a[i]=1

for i in Group_B:
    group_b[i]=1

t_string_A = 'there is a car over there'
t_string_B = 'no one is in a car'

def fun2(string):
    l=[]
    past=0
    for i in range(len(string)):
        if string[i]==' ':
            if string[past:i]!='':
                l.append(string[past:i])
            past=i+1
    return l

def fun(string,dic):
    for i in fun2(string):
   # for i in string.split():
        try:
            if dic[i]:
                return 1
        except:
            pass
    return 0

if fun(t_string_A,group_a)==fun(t_string_B,group_b):
    print(1)
else:
    print(0)
sahasrara62
  • 10,069
  • 3
  • 29
  • 44
  • Thanks for your answer Rana! But it turns out, I need something without split()... T_T;; If you got some answers checking a word 'thing' including 'anything' and 'something' etc. That will be great help ! – Jeong In Kim Apr 30 '19 at 06:07
  • @JeongInKim solution updated, according to what you need, didn't use `split()` but write a function that does the same thing. – sahasrara62 Apr 30 '19 at 06:28
0

This can be solved efficiently as variations on the Aho Corasick algorithm

It is an efficient dictionary matching algorithm that locates patterns within text simultaneously in O(p + q + r), with p = length of patterns, q = length of text, r = length of returned matches.

You may want to run two separate state machines simultaneously, and you would need to modify them so they terminate on the first match.

I took a stab at the modifications, starting with this python implementation

class AhoNode(object):
    def __init__(self):
        self.goto = {}
        self.is_match = False
        self.fail = None

def aho_create_forest(patterns):
    root = AhoNode()
    for path in patterns:
        node = root
        for symbol in path:
            node = node.goto.setdefault(symbol, AhoNode())
        node.is_match = True
    return root

def aho_create_statemachine(patterns):
    root = aho_create_forest(patterns)
    queue = []
    for node in root.goto.itervalues():
        queue.append(node)
        node.fail = root
    while queue:
        rnode = queue.pop(0)
        for key, unode in rnode.goto.iteritems():
            queue.append(unode)
            fnode = rnode.fail
            while fnode is not None and key not in fnode.goto:
                fnode = fnode.fail
            unode.fail = fnode.goto[key] if fnode else root
            unode.is_match = unode.is_match or unode.fail.is_match
    return root

def aho_any_match(s, root):
    node = root
    for i, c in enumerate(s):
        while node is not None and c not in node.goto:
            node = node.fail
        if node is None:
            node = root
            continue
        node = node.goto[c]
        if node.out:
            return True
    return False

def all_any_matcher(*pattern_lists):
    ''' Returns an efficient matcher function that takes a string
    and returns True if at least one pattern from each pattern list
    is found in it.
    '''
    machines = [aho_create_statemachine(patterns) for patterns in pattern_lists]

    def matcher(text):
        return all(aho_any_match(text, m) for m in machines)
    return matcher

and to use it

patterns_a = ['nice','car','by','shop']
patterns_b = ['no','thing','great']

matcher = all_any_matcher(patterns_a, patterns_b)

text_1 = 'there is a car over there'
text_2 = 'no one is in a car'
for text in (text_1, text_2):
    print '%r - %s' % (text, matcher(text))

This displays

'there is a car over there' - False
'no one is in a car' - True
Cireo
  • 4,197
  • 1
  • 19
  • 24
  • Ah sounds fancy! But I think that's way over my capability xD Thanks for your attention though ! It's really interesting running multiple state machines simultaneously. That would help my other problems a lot. – Jeong In Kim Apr 30 '19 at 05:37
  • 1
    @JeongInKim don't give up hope. If you do a little web-searching you will find that very similar solutions to this problem exist that you may be able to start with. In fact, even running the unmodified version (that finds all matches, run twice) has the potential to be several times faster than what you will come up with otherwise. Stopping on the first match could just be an interesting project to undertake – Cireo Apr 30 '19 at 05:47
  • this is a [link only answer](https://meta.stackexchange.com/a/8259/377931) - please [edit] your answer so that is self-sufficient even if the link does not work anymore. – Patrick Artner Apr 30 '19 at 06:02
  • @JeongInKim added an implementation. It is python2.7, but only because of the `print` missing parenthesis and the implicit `object` being declared. You can also return the matching portion quite easily - see the original implementation for portions that I removed – Cireo Apr 30 '19 at 21:05
0

You can iterate over the words and see if any of them is in the string:

from typing import List

def has_word(string: str, words: List[str]) -> bool:
    for word in words:
        if word in string:
            return True
    return False

This function can be modified easily to have has_all_words too.

truth
  • 1,156
  • 7
  • 14