0

I need to split a sentence based on specific word(s) in an array, and we only want whole words that match, not partial words (i.e., if array contains the word "Note", but sentence contains "Notes", it should be ignored).

The words in the array are sorted to match the longest occurrence first.

This is the current code:

import re
sentence = "For the Issuer to note that this is a reference to Section 309B(1) of the FAA. All Issuers must adhere to this."
words_arr = ["Section 309B(1)", "FAA", "Issuer"]
sorted_words = sorted(words_arr, key=len, reverse=True)

regex = re.compile(r"\b(" + "|".join(r"\b%s\b" % re.escape(x) for x in sorted_words) + r")", re.IGNORECASE)

split_text = regex.split(sentence)
# current output ["For the ", "Issuer", " to note that this is a reference to Section 309B(1) of the ", "FAA", ". All Issuers must adhere to this."]

# expected output ["For the ", "Issuer", " to note that this is a reference to ", "Section 309B(1)", " of the ", "FAA", ". All Issuers must adhere to this."]

How can I fix this regex to get the expected output?

cocomac
  • 518
  • 3
  • 9
  • 21
kk55
  • 1
  • 1

0 Answers0