0

I am trying to extract text between two words including the words that set the boundary using findall().

description = 'White cat sat on the mat and then the cat ran away'
starting_word = 'cat'
ending_word = 'ran'

detail_re = r'{0}.*?{1}'.format(starting_word, ending_word)
extracted_text_list = re.findall(detail_re, description,re.IGNORECASE)

Expected result:

['cat sat on the mat and then the cat ran', 'cat ran']

However, the result is:

['cat sat on the mat and then the cat ran']

How can I get the expected answer?

nr spider
  • 134
  • 1
  • 12
  • 1
    That is the expected answer. There's only one match to the *detail_re* in your *description* str. If you're wanting to capture the beginning part you need to have r'\*?{0}.*?{1}' – Jeff Gruenbaum Aug 26 '22 at 14:43
  • Shorter example (oneliner): `re.findall('cat.*?ran', 'White cat sat on the mat and then the cat ran away')` – Thomas Aug 26 '22 at 14:44
  • How could the first item in your expected results possibly be a match? It doesn't start with "cat". – jasonharper Aug 26 '22 at 14:44
  • Try `detail_re = r'(.*\b({0})\b.*?\b({1})\b)'.format(starting_word, ending_word)` – Wiktor Stribiżew Aug 26 '22 at 14:45
  • @JeffGruenbaum this regular expression throws an error 're.error: nothing to repeat at position 0'. I am expecting ['cat sat on the mat and then the cat ran', 'cat ran'] – nr spider Aug 26 '22 at 14:52
  • @nrspider Forgot to add the period. Should be r'.\*?{0}.\*?{1}'. Didn't realize you edited your expected result. Ignore the regex suggestion. It will still only return one match because you only have one word *ran*, so it can only match once. If you change your description to `description = 'White cat sat on the mat and then the ran cat ran away'`, you can see how it will now match twice. – Jeff Gruenbaum Aug 26 '22 at 14:54
  • @WiktorStribiżew thank you for your response. But it returns [('White cat sat on the mat and then the cat ran', 'cat', 'ran')] – nr spider Aug 26 '22 at 14:54
  • @JeffGruenbaum I have edited the expected answer (removed 'white' from the results). My concern is about extracting overlapping results, so there can be two results – nr spider Aug 26 '22 at 14:59
  • Good, so the best you can do is `detail_re = r'\b(({0})\b.*?\b({1}))\b'.format(starting_word, ending_word)`. You cannot put disjoint texts into one group. – Wiktor Stribiżew Aug 26 '22 at 14:59
  • 1
    This works: `import re` `description = 'White cat sat on the mat and then the cat ran away'` `detail_re = r'(?=(cat.*?ran))'` `matches = re.finditer(detail_re, description)` `extracted_text_list = [match.group(1) for match in matches]` `print(extracted_text_list)` – Jeff Gruenbaum Aug 26 '22 at 15:05
  • 1
    Use look ahead assertion. `detail_re = r'(?=({0}.*?{1}))'.format(starting_word, ending_word)` . The rest of the code is fine. – mrin9san Aug 26 '22 at 15:09
  • Correct answer has been provided by @mrin9san. There are two ways to do that. Method 1: Using re detail_re = r'(?=({0}.*?{1}))'.format(starting_word, ending_word) extracted_text_list = re.findall(detail_re, description,re.IGNORECASE) Method 2: Using regex detail_re = regex.compile(r'{0}.*?{1}'.format(starting_word, ending_word)) extracted_text_list = detail_re.findall(description,overlapped = True) Method 2 cannot have re.IGNORECASE – nr spider Aug 26 '22 at 15:19

0 Answers0