4

I am coding in Python 3 on a Windows platform.

I am making a function that will pass in a user's inputted sentence which my function will then .split() and make it a list of each word that was in their original sentence.

My function will also pass in a predefined list of word patterns that my function will be watching for as a match on the exact sequence of words appearing in the user's sentence.

Now just so this is clear, I already can use .intersection() to find where the exact words are matches but I am looking for an exact sequence of words.

For instance if my user inputs: "I love hairy cats", and the predefined list of key words is something like this: ["I love", "hairy cats", "I love cats", "love hair"], my function should only indicate "I love" and "hairy cats" as these two matched the specified sequence of words as they appeared in the predefined lists.

Here is my code thus far:

def parse_text(message, keywords):
    newList = []
    Message = message.split()
    Keywords = keywords      # Keywords need to be a list type
    setMessage = set(word for word in Message)
    setKeywords = set(word for word in Keywords)
    newList = setMessage.intersection(setKeywords)

    return newList

This works perfectly so far only if my keywords list contains only single words. My issue is when I try to make my list with multiple words to denote the sequence.

If my user's original message is:

message = "Hello world, yes and no"

keywords = ["help", "Hello", "yes", "so"]  # this works, intersec "Hello" and "yes"

keywords = ["help me", "Hello mom", "yes and no", "so"]  # this does not work, just returns empty "set()"

Any ideas of how I can make adjustments to my function to check my user's original sentence for a specific sequence of words as they appear my keyword list?

vaylain
  • 67
  • 2
  • 8
  • This is not for an assignment, it is for a small program I am trying to make. Basically the user's sentence will actually be in a JSON() dict from an API call to website...but I didn't want to convolute my question with irrelevant details. – vaylain Jul 08 '16 at 13:27
  • seems like it would be MUCH easier to stringify the list and check for a string within a string. – CaffeineAddiction Jul 08 '16 at 13:28
  • @JulienBernu You can't, because it would return also "love hair", because it matched "love hairy", but OP wants to match whole words. – arekolek Jul 08 '16 at 13:28
  • What about punctuation then? Should `message="Hello, mom"` and `keyword="Hello"` give a match? Because `message.split()` will return `"Hello,"` and not `"Hello"` ... – Julien Jul 08 '16 at 13:33
  • Yes, I have encountered this as well but I know I can fix that part later. For now, I am really concerned on matching exact sequence of words as they appear in keywords list. If a punctuation disqualifies, so be it. – vaylain Jul 08 '16 at 13:36
  • My program will be parsing Slack posts for an exact sequence of words. – vaylain Jul 08 '16 at 13:38

3 Answers3

2

Why use sets at all? This is a pretty straightforward string operation:

def parse_text(message, keywords):
     newList = []
     for keyword in keywords:
         if keyword in message:
             newList.append(keyword)
     return newList

or, using list comprehensions for more succinctness:

def parse_text(message, keywords):
    return [keyword for keyword in keywords if keyword in message]

Finally, one additional form using regular expressions that enforces complete words:

from re import search

def parse_text(message, keywords):
     newList = []
     for keyword in keywords:
         if search(r'\b{}\b'.format(keyword), message):
             newList.append(keyword)
     return newList
Feneric
  • 853
  • 1
  • 11
  • 15
  • 1
    Basically I am only a novice so far with Python and wasn't aware it was this easy. I really appreciate your help as your recommendation was exactly the answer I required. I have spent the last two days getting my code to somewhat work. You knocked this out in a few minutes. My hat off to you sir! – vaylain Jul 08 '16 at 13:47
  • You're welcome. Don't knock yourself. It's easy to get thinking of a problem in a direction that makes it a bit harder to solve, and it's often easier for someone outside to see the alternative approach. – Feneric Jul 08 '16 at 13:49
  • 1
    I do not get this. `parse_text("I love hairy cats", ["I love", "hairy cats", "I love cats", "love hair"])` returns `"love hair"` while the author precised it should not. – Delgan Jul 08 '16 at 13:51
  • You need `r'\b{}\b'`, so you don't match "hairy cats" when you look for "airy cats". Also, `re` solution works as a list comprehension just as well. – arekolek Jul 08 '16 at 14:12
  • Thanks, I added the missing flag. Yes, it's just a style question of whether or not it's more or less readable as the list comprehension. I agree it's a perfectly valid approach. – Feneric Jul 08 '16 at 14:16
1

This can easily be done by transforming your keywords list to a list of lists, and then check for lists which are sublist of your message words.

def is_sublist(sub_lst, lst):
    n = len(sub_lst)
    return any((sub_lst == lst[i:i + n]) for i in range(len(lst) - n + 1))

message = "Hello world yes and no"
words = message.split()

keywords = ["help me", "Hello mom", "yes and no", "so"]
keywords_lists = [k.split() for k in keywords]
# [['help', 'me'], ['Hello', 'mom'], ['yes', 'and', 'no'], ['so']]

new_sub_lists = [k for k in keywords_lists if is_sublist(k, words)]
new_list = [" ".join(k) for k in new_sub_lists]
# ['yes and no']

The is_sublist function (inspired from @Nas answer) is far from being optimal.

If your are looking for a solution with a small complexity, you should take a look at others string searching algorithms, because your problem can be see like this, with your words being letters.

Community
  • 1
  • 1
Delgan
  • 18,571
  • 11
  • 90
  • 141
  • This was also helpful as I studied your technique. Seems like I can glean some usefulness from your approach as well. Thank you. – vaylain Jul 08 '16 at 13:59
1

You could do something like:

    def parse_text(message, keywords):
        return [kw for kw in keywords if kw in message]
jpm
  • 156
  • 1
  • 5