3

I want to remove all lines that include a b in this multiline string:

aba\n
aaa\n
aba\n
aaa\n
aba[\n\n - optional]

Note the file is not necessarily terminated by a newline character, or may have extra line breaks at the end that I want to keep.

This is the expected output:

aaa\n
aaa[\n\n - as in the input file]

This is what I have tried:

import re
String = "aba\naaa\naba\naaa\naba"
print(String)
print(re.sub(".*b.*", "", String))  # this one leaves three empty lines
print(re.sub(".*b.*\n", "", String))  # this one misses the last line
print(re.sub("\n.*b.*", "", String))  # this one misses the first line
print(re.sub(".*b.*\n?", "", String))  # this one leaves an empty last line
print(re.sub("\n?.*b.*", "", String))  # this one leaves an empty first line
print(re.sub("\n?.*b.*\n?", "", String))  # this one joins the two remaining lines

I have also tried out flags=re.M and various look-aheads and -behinds, but the main question seems to be: how can I remove either the first or the last occurrence of \n in a matching string, depending on which on exists - but not both, if both do exist?

bers
  • 4,817
  • 2
  • 40
  • 59

2 Answers2

3

You may use a regex or a non-regex approach:

import re
s = "aba\naaa\naba\naaa\naba"
print( "\n".join([st for st in s.splitlines() if 'b' not in st]) )
print( re.sub(r'^[^b\r\n]*b.*[\r\n]*', '', s, flags=re.M).strip() )

See the Python demo.

Non-regex approach, "\n".join([st for st in s.splitlines() if 'b' in st]), splits the string with line breaks, filters out all lines not having b, and then joins the lines back.

The regex approach involves the pattern like r'^[^b\r\n]b.*[\r\n]*':

  • ^ - start of a line
  • [^b\r\n]* - any 0 or more chars other than CR, LF and b
  • b - a b char
  • .* - any 0+ chars other than line break chars
  • [\r\n]* - 0+ CR or LF chars.

Note you need to use .strip() to get rid of the unwanted whitespace at the start/end of the string after this.

A single regex solution is too cumbersome, I would not advise to use it in real life:

rx = r'(?:{0}(?:\n|$))+|(?:\n|^){0}'.format(r'[^b\n]*b.*')
print( re.sub(rx, '', s) )

See Python demo.

The pattern will look like (?:[^b\n]*b.*(?:\n|$))+|(?:\n|^)[^b\n]*b.* and it will match

  • (?:[^b\n]*b.*(?:\n|$))+ - 1 or more repetitions of
    • [^b\n]* - any 0+ chars other than b and a newline
    • b.* - b and the rest of the line (.* matches any 0+ chars other than a newline)
    • (?:\n|$) - a newline or end of string
  • | - or
    • (?:\n|^) - a newline or start of string
    • [^b\n]*b.* - a line with at least one b on it
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • +1 for the `splitlines()`/`join()` approach. I would of course need to write `'b' not in st` as a regex, as my real code is more complicated than this simple example. But it might work. I would prefer a full regexp solution, however, and your second proposal leaves an empty line at the end: https://ideone.com/tPfKKH – bers Mar 04 '19 at 22:28
  • 1
    @bers Add `.strip()`, it will be cleanest with a regex approach. Actually, any removing approach of this kind will involve quite an unreadable regex. – Wiktor Stribiżew Mar 04 '19 at 22:31
  • 1
    @bers Some pattern like `r'(?m)(?:[\r\n]+.*b.*)+$|^.*b.*[\r\n]*'` could be used ([demo](https://ideone.com/7Hwx62)), I guess, but even I would not use it in production. – Wiktor Stribiżew Mar 04 '19 at 22:39
  • This last solution removes empty lines before the last one: https://ideone.com/RHLaDr - maybe replace `+` by `?`. Anyway, we have an accepted regex-based answer. – bers Mar 07 '19 at 06:53
  • @bers Just do not remove the line breaks after the line, https://ideone.com/tKEBwa, if you do not want to remove linebreaks. – Wiktor Stribiżew Mar 07 '19 at 07:21
  • https://ideone.com/tKEBwa leaves a blank line at the beginning... I guess whatever we try here regex-wise will converge towards the accepted answer. – bers Mar 07 '19 at 11:49
  • @bers I think you may use a shorter and more efficient regex, and build it dynamically, see https://ideone.com/IQPAVj – Wiktor Stribiżew Mar 07 '19 at 12:06
  • No, that version removes a blank line if all lines are to be replaced: https://ideone.com/Q7UNNq Also, it kind of seems seems to require some negative of the line-matching pattern (`^b`), which can be pretty difficult if that gets more complicated. – bers Mar 07 '19 at 13:10
1

There are three cases to take into account in your re.sub() call to remove lines with a b in them:

  1. patterns followed by an end of line character (eol)
  2. the last line in the text (without a trailing eol)
  3. when there is only one line with no trailing eol

In that second case, you want to remove the preceding eol character to avoid creating an empty line. The third case will produce an empty string if there is a "b".

Regular expressions' greed will introduce a fourth case because there can't be any pattern overlaps. If your last line contains a "b" and the line before that also contained a "b", case #1 will have consumed the eol character on the previous line so it won't be eligible to detect the pattern on the last line (i.e eol followed by the pattern at the end of text). This can be addressed by clearing (case#1) consecutive matching lines as a group and including the last line as an optional component of that group. Whatever this leaves out will be trailing lines (case#2) where you want to remove the preceding eol rather than the following one.

In order to manage repetition of the line pattern .*b.* you will need to assemble your search pattern from two parts: The line pattern and the list pattern that uses it multiple times. Since we're already deep in regular expressions, why not use re.sub() to do that as well.

import re

LinePattern = "(.*b.*)"
ListPattern = "(Line\n)+(Line$)?|(\nLine$)|(^Line$)" # Case1|Case2|Case3
Pattern     = re.sub("Line",LinePattern,ListPattern)

String  = "aba\naaa\naba\naaa\naba"
cleaned = re.sub(Pattern,"",String)

Note: This technique would also work with a different separation character (e.g. comma instead of eol) but the character needs to be excluded from the line pattern (e.g. ([^,]*b[^,]*) )

Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • That's a straightforward idea, thank you! Do you know of a way to avoid repeating the `.*b.*` pattern? My pattern is pretty long and would make that hardly readable. Based on https://stackoverflow.com/q/19794603/880783, I would use `p = ".*b.*", re.sub("(" + p + "\n)|(\n" + p + "$)|(^" + p + "$)", "", String) ` – bers Mar 05 '19 at 08:12
  • 1
    I don't know of a way to avoid pattern repetitions in regular expressions (group references such as \1 match the same result they don't reapply the pattern). By the way I discovered that the greediness of the expression creates a 4th case that I hadn't though about: When there are more than 2 lines and the last 2 lines need to be excluded and there is no eol at then end of the text. In that case, the one before last eats up the eol and the last line is not removed. You could repeat the replacements until it no longer changes the string but that would be inefficient and defeats the purpose. – Alain T. Mar 05 '19 at 14:24
  • I'll try to find a way to circumvent this issue and get back to you. – Alain T. Mar 05 '19 at 14:28
  • I found a way to do it and to manage the repetition of your more complex pattern. See the updated answer. – Alain T. Mar 05 '19 at 15:50
  • Great answer, thank you! I had not thought about that edge case, either. – bers Mar 06 '19 at 11:22
  • There was another edge case. Fixed it now. – Alain T. Mar 06 '19 at 17:58
  • Forgot to put in Case #3 in that last fix. Added it back now. – Alain T. Mar 06 '19 at 23:40
  • Is my understanding correct that case #2 could also be `\nLine$` (just adding $)? This should not change anything, but make the purpose of that case more obvious to the reader. – bers Mar 07 '19 at 06:35
  • Yes that should work as well because the only situation I saw where case#1 would leave something out for case#2 is when the \nLine is at the end of the string. i.e. line before last not removed when last line needs to be removed. – Alain T. Mar 07 '19 at 07:09