1

I am trying to create a regex that will match a full sentence that includes a keyword. This is an example passage:

"Cash taxes paid, net of refunds, were $412 million 2016. The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax."

I want to match full sentences that include the keyword "subsidiaries". To accomplish this, I have been using the following regular expression:

[^.]*?subsidiaries[^.]*\.

However, this will only match " Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U" since the expression begins and ends at the "." in "U.S.". Is there a way to specify in the expression that I do not want it to stop at specific phrases, such as "U.S." or ".com"?

A. Ryan
  • 19
  • 2

1 Answers1

0

I suggest tokenizing the text into sentences with NLTK, and then check if a string is present in each item or not.

import nltk, re
text = "Cash taxes paid, net of refunds, were $412 million 2016. The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax."
sentences = nltk.sent_tokenize(text)
word = "subsidiaries"
print([sent for sent in sentences if word in sent])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']

To only extract affirmative sentences (ending with .) add and sent.endswith('.') condition:

print([sent for sent in sentences if word in sent and sent.endswith('.')])

You may even check if the word you filter against is a whole word search with a regular expression:

print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563