3

how to extract the text between two known words in a string with a condition that the text between these words can be i) 1 character ii) 1 word iii) 2 words etc.?

Sample Text:

text = ("MNOTES - GEO GEO MNOTES 20 231-0005 GEO GEO GEO GEO GEO MNOTES SOME REVISION MNOTES CASUAL C GEO GEO GEO GEO GEO MNOTES F232322500 MNOTES HELP PAGES GEO GEO GEO GEO MNOTES SHEET 1 OF 3 GEO GEO MNOTES CASUAL E. GEO GEO MNOTES SITPOPE/TIN AY GEO GEO MNOTES R GEO GEO GEO GEO MNOTES 22+0436/T.SKI/11-AUG-1986 GEO GEO GEO GEO MNOTES 231-0045 GEO")

I have a string like above that have multiple occurrences of these two known words 'MNOTES' and 'GEO', however the text between them can be anything and any number of words.

I wanted to extract sometimes the text that has only one character between those two known words or sometimes the text that has 2 words between those two known words or sometimes the text that has 6 words between those two known words etc., So, how can i extract along with the condition ?

user10256551
  • 31
  • 1
  • 3
  • 1
    Would be good if you post your desired output from this string. – Pankaj Mar 07 '19 at 03:01
  • condition 1: extract text that has one character between MNOTES and GEO output 1: '-' , 'R' ; condition 2: extract text that has two words between MNOTES and GEO output 2: '20 231-0005' , 'CASUAL C', 'CASUAL E.', 'SITPOPE/TIN AY' – user10256551 Mar 07 '19 at 03:08
  • You want to use regular expressions. Check out this answer, I hope it solves your problem. https://stackoverflow.com/questions/32680030/match-text-between-two-strings-with-regular-expression – Adam Howard Mar 07 '19 at 02:46

1 Answers1

4

Use re.findall.

import re

re.findall('MNOTES(.*?)GEO', text)

This results in:

[' - ', ' 20 231-0005 ', ' SOME REVISION MNOTES CASUAL C ', ' F232322500 MNOTES HELP PAGES ', ' SHEET 1 OF 3 ', ' CASUAL E. ', ' SITPOPE/TIN AY ', ' R ', ' 22+0436/T.SKI/11-AUG-1986 ', ' 231-0045 ']

Edit

To get a specific amount of characters the following will work:

re.findall('MNOTES\s?(.{1})\s?GEO', text)

Results in

['-', 'R']

and to get only results that are 6-8 characters long:

re.findall('MNOTES\s?(.{6,8})\s?GEO', text)

Results:

['- GEO ', 'CASUAL C', 'R GEO ', '231-0045']
Jab
  • 26,853
  • 21
  • 75
  • 114
  • Your output is wrong. As per your answer it would be "[' - ', ' 20 231-0005 ', ' CASUAL C ', ' SHEET 1 OF 3 ', ' CASUAL E. ', ' R ', ' 231-0045 ']" – Pankaj Mar 07 '19 at 03:04
  • i wanted to extract based on the count of words or characters. for example, like get me the output only when the text between the words are one character...in this case i wanted only '-', 'R' alone, how can i put that as condition in the script? please suggest. – user10256551 Mar 07 '19 at 03:04
  • @PS1212 no, see: https://repl.it/repls/ProfuseCorruptDirectory what I have in my answer is correct – Jab Mar 07 '19 at 03:17
  • I cannot seem to get per words working but this get's you split by character count – Jab Mar 07 '19 at 03:34