4

So I have the following text:

a
111
b
222
c
333
d

and I want to capture all contents between these alphabetical delimiters. So I tried

import re
test_str=r"""a
111
b
222
c
333
d
"""
res = re.findall(r"[a-z]{1}\n([\d\D]+?)\n[a-z]{1}", test_str)

Note that [\d\D] is for any character including newlines, because in real examples the contents in between may be complicated and contain many lines. Anyway, my expected output is

['111', '222', '333']

but instead, the actual result is

['111', '333']

The reason I guess is that when the first occurrence a\n111\nb is matched, it is somehow "taken away" from the string and doesn't enter the subsequent matching process, leading to the error.

Is there any simple way to capture contents between such consecutive delimiters? Thanks in advance.

Vim
  • 1,436
  • 2
  • 19
  • 32
  • 2
    P.S, `{1}` is redundant... – Adam.Er8 Aug 01 '20 at 14:05
  • 2
    [regex-lookbehind-and-lookahead](https://stackoverflow.com/questions/47886809/python-regex-lookbehind-and-lookahead) - they do not "consume" the match – Patrick Artner Aug 01 '20 at 14:05
  • @Sushanth thanks for your comment. But the `test_str` is just a toy example. My real problem involves text where the contents in between can literally be anything and not just numbers. – Vim Aug 01 '20 at 14:05

2 Answers2

3

You can use a (positive) lookahead instead:

r"(?s)[a-z]\n(.+?)(?=[a-z])" 

it does not consume the matched part, just assures there is a match possible.

res = re.findall(r"(?s)[a-z]\n(.+?)(?=[a-z])", test_str) # ['111\n', '222\n', '333\n']

See https://regex101.com/r/6FEFkZ/2 or Python regex lookbehind and lookahead

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

This solution will not use regex but is simple and easy to understand

import string
teststr = """
111
a
222
b
333
"""
print([i for i in teststr.split('\n') if i not in string.ascii_lowercase])
  • (and probably faster as well..) – Adam.Er8 Aug 01 '20 at 14:15
  • thanks for your answer which works perfectly for the test example. But my real problem involves text where the contents in between can literally be anything and not just numbers. – Vim Aug 01 '20 at 14:18
  • What is 'anything' –  Aug 01 '20 at 14:46
  • This will work fast, and it will work with any case where the delimeters are lowercase alphabetical letters. It will work regardless of the number of newlines. It will work regardless of the presence of alphabets in the Content –  Aug 01 '20 at 15:29
  • @AryanParekh unfortunately the delimiter may not be just alphabetical letters... it can be a regex pattern. – Vim Aug 01 '20 at 15:55
  • In that case, it wont work. You mentioned i want to capture it between * Alphabetical Delimaters* So i had answered accordingly –  Aug 01 '20 at 15:59