1

I have following raw text output that I need to extract selective information but my regex in python does not pick up the selective information. My string is:

label 123 start
    int
    some other random text
    exit
exit
label 576 start
    int
    some other random text
    exit
exit
label 888 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
label 902 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
label 456 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit

From the above the text string I would like to capture following items as individual items

Item 1
label 888 start
    explanation jgfjgjgj 
    some random text 
    exit
up up
exit
Item 2
label 902 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
Item 3
label 456 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit

I have following regex:

(label)\s\d{1,4}(.*?)(?=\s*explanation)(.*?)\s+up up

That also captures following two items which I do not want:

label 123 start
    start
    some other random text
    exit
exit
label 576 start
    start
    some other random text
    exit
exit

I have constructed based on the basis that it does a lookahead for word "explanation" and only capture the items starting at label and finishing at 'up up'. The first item it captures all of label 123 and label 576. The lookahead i thought should have stopped it but it captures it.

frank
  • 59
  • 2
  • 9
  • 1
    You need to use a negative lookahead to prevent `.*?` from going to the next item to find `explanation`. – Barmar May 19 '17 at 02:13
  • How would I construct the regex for negative lookahead to make sure it does not match the first two items that do not have the word explanation. Thanks – frank May 19 '17 at 02:20
  • Not sure. But why did you undo all the formatting fixes I made to your question, and put back those stupid `` tags? – Barmar May 19 '17 at 02:36
  • Maybe something like `(.*?(?!label))(?=\s*explanation)` – Barmar May 19 '17 at 02:44
  • See http://stackoverflow.com/questions/19750096/python-regex-find-a-substring-that-doesnt-contain-a-substring – Barmar May 19 '17 at 02:45
  • The lookahead makes it stop when it gets to `explanation`. But there's nothing making it stop when it gets to another `label`, so `.*` can include multiple label blocks. – Barmar May 19 '17 at 02:49
  • I tried your suggestion and it selects all the of the string. I want the regex to start selecting from label 888 start and any of these that have keyword explanation. – frank May 19 '17 at 03:26
  • @frank: So the `up up` is not necessary for the capture? Or you want both `explanation` and `up up`? Your specification is imprecise. – rici May 19 '17 at 03:59
  • @rici, For the items that have word "explanation", i.e. not "int" I would like capture the following as an example : `label 888 start explanation jgfjgjgj some random text exit up up exit` – frank May 19 '17 at 04:28
  • @frank: You know more about what you are trying to achieve than we do, and I still don't know from your description whether `up up` is required or optional. But I did my best to produce a specification and a regex which matches it. – rici May 19 '17 at 04:30
  • @frank Which suggestion did you try? The incorrect regexp I put in my comment, or the answers at the question I linked to? Those answers should work better. – Barmar May 19 '17 at 17:19

2 Answers2

0

I'm assuming that what you are looking for is a stanza which:

  • starts with the unindented line starting label followed by an integer
  • includes an indented line starting explanation
  • does not include any other unindented lines, except that it is terminated with an unindented up up followed by an unindented exit.

That corresponds to the regular expression:

(?mx)^label[ \t]+\d{1,4}.*     # Unindented line starting label
     (?:\n[ \t]+.*)*?          # Some indented lines (non-greedy)
     (?:\n[ \t]+explanation.*) # Indented explanation
     (?:\n[ \t]+.*)*           # More indented lines
     \nup\ up\nexit\n          # Termination sequence including final newline

Testing:

text="""label 123 start
    int
    some other random text
    exit
exit
label 576 start
    int
    some other random text
    exit
exit
label 888 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
label 902 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
label 456 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
"""

r = r'''(?mx)
    ^label[ \t]+\d{1,4}.*     # Unindented line starting label
    (?:\n[ \t]+.*)*?          # Some indented lines (non-greedy)
    (?:\n[ \t]+explanation.*) # Indented explanation
    (?:\n[ \t]+.*)*           # More indented lines
    \nup\ up\nexit\n          # Termination sequence including final newline
'''

for i, m in enumerate(re.findall(r, text)):
    print("Item "+str(i)+"\n"+m)

Item 0
label 888 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit

Item 1
label 902 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit

Item 2
label 456 start
    explanation jgfjgjgj
    some random text 
    exit
up up
exit
rici
  • 234,347
  • 28
  • 237
  • 341
  • this still matches the following sting (items), which I do not want:`label 123 start start some other random text exit exit label 576 start start some other random text exit exit` – frank May 19 '17 at 04:37
  • @frank: Not on my machine – rici May 19 '17 at 04:45
  • I tried it here and cannot make it work https://regex101.com (https://regex101.com/r/OkefJ7/1) Question: what does (?mx) do? – frank May 19 '17 at 05:25
  • @frank: your question says python so I did it in python. You really have to be careful with online regex testers: make sure you specify the regex dialect you are using. If the regex tester doesn't handle your regex dialect, find a different one. I'd just test python regexes with python, if I were you. `(?mx)` (in python) sets the multiline and extended format flags; these are `re.M` and `re.X`. See http://docs.python.org (`re` module) for details. – rici May 19 '17 at 05:35
  • @frank, Actually, that regex tester explains the flags. However, you seem to have set the `s` flag, which will cause the pattern to fail. Don't do that; it is not the default. If you turn that flag off, it should work. (But I still think you should just try it in Python.) – rici May 19 '17 at 05:42
  • I will test with python. I did modify the flags on the online regex site, now it is performing the way it should except two entries which should be selected is missed. Have look at this one: [online version](https://regex101.com/r/OkefJ7/2) – frank May 19 '17 at 06:21
  • @frank: the first one doesn't match because of the space before `up up`, and the last one doesn't match because the last exit is not terminated with a newline. If your input might show those features, you'll need to adjust the regex (eg by putting `[ \t]*` before `up up` -- if you really want to allow it to be optionally indented -- and adding a `?` after the last `\n`.) This is why being able to precisely describe the target input is so important. – rici May 19 '17 at 06:39
  • thanks, what does (?: what does this do in the regex? I see it is all along the regex. So basically you start out with looking at the label and then the key is make only match the ones which have explanation word in it. So no need to do an lookahead or lookbehind using regex. correct? – frank May 19 '17 at 08:58
  • rici, the regex you suggested is working in python. one more question, for case where there match,item 0, what happens if there are 50 lines of text between label 888....50lines of text and then up up. What is the most practical way to capture items with lots of number of lines. – frank May 19 '17 at 11:56
  • @frank `(?:` tells the regex engine that the parenthesized subexpression does not need to be captured. That can speed things up on some regex engines; I don't know how much use it is in Python, but it can't hurt (it's not posix, though). Since the pattern doesn't use lookarounds, it should be fine on long matches as long as the `explanation` line is close to the beginning of the stanza. Non-greedy matches tend to be slow if they need to extend a lot, but that shouldn't be the case here. – rici May 19 '17 at 14:17
0

Check following regex -

(label\s\d{1,4}\sstart(\s*explanation)(.*?)up\sup\s*exit)

It should work. Check here for demo

Amey Dahale
  • 750
  • 6
  • 10
  • Hi Amey, yes that works fine and also takes care of the scenario where there are lots of lines of text between label and up up. In python i am using dotall flag but i see that you used the following `(.*?)` after the explanation, which does not catch the item at the begining of the string. I will do some more testing. – frank May 19 '17 at 13:49
  • Have you tested by passing the flag as re.S while matching the pattern? – Amey Dahale May 19 '17 at 17:48
  • Yes I have tested using re.S which i believe is short for re.DOTALL – frank May 21 '17 at 05:00
  • Amey, At the regex you suggest catpures what I am after, but findall in python return a list of tuples. My understanding if the regex has mutiple capture groups then it will return list of tuples. If i only want a list with a single capture group, following should work: `(label\s\d{1,4}\sstart\s*explanation.*?up\sup\s*exit)`, I have removed all the parenthesis except for one. – frank May 21 '17 at 07:24