1

I'm trying to parse sentences in python- for any sentence I get I should take only the words that appear after the words 'say' or 'ask' (if the words doesn't appear, I should take to whole sentence) I simply did it with regular expressions:

sen = re.search('(?s)(?<=say|Say).*$', current_game_row["sentence"], re.M | re.I)

(this is only for 'say', but adding 'ask' is not a problem...)

The problem is that if I get a sentence with punctuations like comma, colon (,:) after the word 'say' it takes it too. Someone suggested me to use nltk tokenization in order to define it, but I'm new in python and don't understand how to use it. I see that nltk has the function RegexpParser but I'm not sure how to use it. Please help me :-)

** I forgot to mention that- I want to recognize 'said'/ asked etc. too and don't want to catch word that include the word 'say' or 'ask' (I'm not sure there are such words...). In addition, if where are multiply 'say' or 'ask' , I only want to catch the first token in in the sentence. **

Razzle Shazl
  • 1,287
  • 1
  • 8
  • 20
merav
  • 33
  • 4
  • You could simply `re.split(r'\b(?:say|ask)\b[,.;:!?]*', sentence)` and check if the result is more than one element. What should happen if there are multiple "say" or "ask" tokens? What about inflections like "said" and "asked"? – tripleee Feb 05 '21 at 11:57
  • You're right, I forgot to mention this- I want to recognize 'said' etc. and don't want to catch word that include the word 'say' (I'm not sure there are such word...). In addition, if where is multiply 'say' or 'ask' , I only want to catch the first token – merav Feb 05 '21 at 12:55
  • So `re.split(r'(?:\bsa(?:ys?|id)|ask(?:ed|s)?\b[,.;:?!]*', sentence, 1, re.I)[-1]`. The `\b` word boundaries prevent the regex from matching in the middle of a word. You don't need to enumerate lower vs upper case with `re.I` (though then of course it will also match on "aSkS" etc). – tripleee Feb 05 '21 at 13:04
  • "Essay" is an example of a word which contains "say" as a substring. "Ask" is much easier to find examples of ("task", "basked", "flasks", etc). – tripleee Feb 05 '21 at 13:08

1 Answers1

0

Everything after a Keyword

We can deal with the unwanted punctuation by using \w to eat up all non-unicode.

sentence = "Hearsay? With masked flasks I said: abracadabra"

keys = '|'.join(['ask', 'asks', 'asked', 'say', 'says', 'said'])
result = re.search(rf'\b({keys})\b\W+(.*)', sentence, re.S | re.I)

if result == None:
    print(sentence)
else:    
    print(result.group(2))

Output:

abracadabra 

case-sensitive: You have case-insensitive flag re.I, so we can remove Say permutation.

multi-line: You have re.M option which directs ^ to not only match at the start of your string, but also right after every \n within that string. We can drop this since we do not need to use ^.

dot-matches-all: You have (?s) which directs . to match everything including \n. This is the same as applying re.S flag.

I'm not sure what the net effect of having both re.M and re.S is. I think your sentence might be a text blob with newlines inside, so I removed re.M and kept (?s) as re.S

Razzle Shazl
  • 1,287
  • 1
  • 8
  • 20