2

I'd like to match (and replace with a custom replacement function) each block of consecutive lines that all start by foo. This nearly works:

import re

s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""

def f(m):
    print(m)

s = re.sub('(foo.*\n)+', f, s)
print(s)
# <re.Match object; span=(17, 53), match='foo1234\nfoohelloworld\nfooloremipsum\n'>
# <re.Match object; span=(61, 76), match='foo236\nfoo5382\n'>

but it fails to recognize the last block, obviously because it is the last line and there is no \n at the end.

Is there a cleaner way to match a block of one or multiple consecutive lines starting with same pattern foo?

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 2
    Try adding a `?` after your newline token to indicate either 0 or 1 of them, thus also matching a block that ends the file: `(foo.*\n?)+` – Adid Jan 31 '22 at 09:34
  • Actually your pattern matches any *foo*, even if it isn't at the start of a line. – Casimir et Hippolyte Jan 31 '22 at 09:37
  • @Adid that will also match a foo that is in the middle of a line – mozway Jan 31 '22 at 09:38
  • @WiktorStribiżew Thanks, but I need more precisely a match/replace (with custom replacement function), and not a `re.split`. I tried your pattern in my `re.sub` but it doesn't work: there are many empty matches (?) – Basj Jan 31 '22 at 09:40
  • @CasimiretHippolyte Oops that's right. How would you add this criteria ("starting by foo")? `^foo` did not seem to work in my case. – Basj Jan 31 '22 at 09:41
  • My apologies, I made a mistake. You could add the option to match either newline or end of string using `((?:^|\n)foo.*(?:$|\n))+` – Adid Jan 31 '22 at 09:41
  • @WiktorStribiżew No, because in this case there is *one match per line*; instead I want one match/one call of replacement function *per whole block*. – Basj Jan 31 '22 at 09:42
  • 1
    like that: `(?m)^foo.*(?:\nfoo.*)*\n?` – Casimir et Hippolyte Jan 31 '22 at 09:42
  • It must be just a `re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)` or `re.sub(r'^foo.*(?:\nfoo.*)*', f, s, flags=re.M)` – Wiktor Stribiżew Jan 31 '22 at 09:46
  • Note you do not need `(?=\n|$)` after `.*` because `.` does not match line feed chars, so there is no need to check for the end of a line. Unless you want to consume the newline, then you would use something like `(?:\n|$)`, but it does not look like you need that. – Wiktor Stribiżew Jan 31 '22 at 09:51
  • 1
    I'm with @Adid (add `^` anchor & `re.MULTILINE` if needed). Might be sufficient to use: [`^(?:foo.*\n?)+`](https://regex101.com/r/GJw61i/1) – bobble bubble Jan 31 '22 at 12:32

3 Answers3

3

Here is an re.findall approach:

s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""

lines = re.findall(r'^foo.*(?:\nfoo.*(?=\n|$))*', s, flags=re.M)
print(lines)
# ['foo1234\nfoohelloworld\nfooloremipsum',
   'foo236\nfoo5382',
   'foo879']

The above regex runs in multiline mode, and says to match:

^                     from the start of a line
foo                   "foo"
.*                    consume the rest of the line
(?:\nfoo.*(?=\n|$))*  match newline and another "foo" line, 0 or more times

Edit:

If you need to replace/remove these blocks, then use the same pattern with re.sub and a lambda callback:

output = re.sub(r'^foo.*(?:\nfoo.*(?=\n|$))*', lambda m: "BLAH", s, flags=re.M)
print(output)

This prints:

bar6387
bar63287
BLAH
baz
bar
BLAH
bar
BLAH
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you! Note: I need a `re.sub` with a custom replacement function, with one call of the function per *whole* block. Does it work too? (I will try in the meantime) – Basj Jan 31 '22 at 09:43
  • @Basj Check the bottom of my answer for an `re.sub` option. You may place whatever replacement logic you want for each block inside the lambda function. – Tim Biegeleisen Jan 31 '22 at 09:48
  • 2
    Note that `(?=\n|$)` is always true since the previous `.*` is greedy. – Casimir et Hippolyte Jan 31 '22 at 09:51
2

Do you really need a regex? Here is a itertools.groupby based approach:

from itertools import groupby
import re

# dummy example function
f = lambda x: '>>'+x.upper()+'<<'

out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G
               for k,g in groupby(s.split('\n'), lambda l: l.startswith('foo')))

print(out)

NB. you don't need a regex, but you can also use a regex if needed to define the matching lines in groupby

# using a regex to match the blocks:
out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G
               for k,g in  groupby(s.split('\n'),
                                   lambda l: bool(re.match('foo', l))
                                   ))

ouput:

bar6387
bar63287
>>FOO1234
FOOHELLOWORLD
FOOLOREMIPSUM<<
baz
bar
>>FOO236
FOO5382<<
barfoo
bar
>>FOO879<<
mozway
  • 194,879
  • 13
  • 39
  • 75
1

You can use

re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)
re.sub(r'^foo.*(?:\nfoo.*)*', f, s, flags=re.M)

where

  • ^ - matches start of string (here, a start of any line due to (?m) or re.M option)
  • foo - matches foo
  • .* - any zero or more chars other than line break chars as many as possible
  • (?:\nfoo.*)* - zero or more sequences of a newline, foo and then the rest of the line.

See the Python demo:

import re

s = "bar6387\nbar63287\nfoo1234\nfoohelloworld\nfooloremipsum\nbaz\nbar\nfoo236\nfoo5382\nbar\nfoo879"
def f(m):
    print(m.group().replace('\n', r'\n'))

re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)

Output:

foo1234\nfoohelloworld\nfooloremipsum
foo236\nfoo5382
foo879
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! What does `(?:...)*` mean exactly in regex? I know `?` as [optional quantifier](https://stackoverflow.com/questions/5583579/question-marks-in-regular-expressions) but what is `?:...`? – Basj Jan 31 '22 at 12:30
  • @Basj A [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). – Wiktor Stribiżew Jan 31 '22 at 12:31