Using sed/awk to remove string from subsections

Question

I have a file that looks like this:

bar
barfo
barfoo
barfooo
barfoooo

sample
sampleText1
sampleText2
sampleText3

prefix
prefixFooBar
prefixBarFoo

What I want sed (or awk) to do is to remove the string which introduces a section, from all of its contents, so that I end up with:

bar
fo
foo
fooo
foooo

sample
Text1
Text2
Text3

prefix
FooBar
BarFoo

I tried using

sed -e -i '/([[:alpha:]]+)/,/^$/ s/\1//g' file

But that fails with "Invalid Backreference".

score 5 · Answer 1 · answered Oct 27 '18 at 20:32

5

$ awk '{$0=substr($0,idx)} !idx{idx=length($0)+1} !NF{idx=0} 1' file
bar
fo
foo
fooo
foooo

sample
Text1
Text2
Text3

prefix
FooBar
BarFoo

answered Oct 27 '18 at 20:32

Ed Morton

188,023
17
78
185

To be fair: Just like karakfa makes assumptions, you also do. This solution works only if all the paragraph strings start with the subject word (which is consistent with the example, but not with the question text - and so is karakfa's answer). If that's not the case, your awk will not remove *all* the words, not even once, it will remove something else instead. This is handled better by karakfa. – steffen Oct 28 '18 at 11:22
There are reasonable assumptions and then there are unreasonable assumptions. When sample input always includes a generic string like "sample" in a specific position it's reasonable to assume a string "sample" does always appear in that position. It's also reasonable to assume that the poster wants it treated as a string rather than a regular expression and so if "sample" contains RE metachars they should be treated literally unless otherwise stated. It's NOT reasonable to assume that "sample" will always literally be the word "sample" or even that it will always contain non-RE meta-characters – Ed Morton Oct 28 '18 at 12:02
I agree that it should be treated as string, but I disagree on the position and on the number of matches. Why? Because of two indications: (1) The question text itself states it clearly and (2) His `sed` expression contains a `g`. In the past you've been eagerly encouraging users to add hints about possible drawbacks. (In fact that's the reason for the second sentence in my `sed` solution.) So I guess you should add a note about that here. – steffen Oct 28 '18 at 12:13
His sed expression is simply wrong. Please clarify "The question text itself states it clearly" - states what and where? – Ed Morton Oct 28 '18 at 12:35
(1) "remove the string from all of the section's contents". (2) The sed is simply wrong, yes, but i find it quite clever: Apparently he was trying to apply `s/\1//g` for each line of a section. A section starts with `[[:alpha:]]+` and ends with `^$`. Finally he was adding parens in order to be able to reference the section start in the `s` command. Just a few problems: First, the group isn't available in the `s` command any more, second, unlike in vim e. g. he can't `/.../+1,/.../` and third, he had to handle the last line. But you can get some information out of that attempt. – steffen Oct 28 '18 at 12:52
IMHO it's extremely clear that he's simply talking about removing the header string from the start of each line of the section as evidenced by the sample input he provided. The sed script is just him trying to some up with something that'll do what he wants - all we can determine from it is that it doesn't do what he wants. – Ed Morton Oct 28 '18 at 12:54
1

OK, I agree, you should not mention it. – steffen Oct 28 '18 at 12:57

score 3 · Answer 2 · answered Oct 27 '18 at 20:38

3

another awk

$ awk '{sub(pre,"")}1; !NF{pre=""} !pre{pre=$1}' file

bar
fo
foo
fooo
foooo

sample
Text1
Text2
Text3

prefix
FooBar
BarFoo

answered Oct 27 '18 at 20:38

karakfa

66,216
7
41
56

1

That will fail if the first string in a block contains RE metachars or is numerically equal to zero. – Ed Morton Oct 27 '18 at 20:39
This will also not work if the first string contains whitespace. – steffen Oct 28 '18 at 11:24

ikegami · Accepted Answer · 2018-10-28T02:12:21.443

perl -ple'
   if (!length($_)) { $re = "" }
   elsif (!length($re)) { $re = $_ }
   else { s/^\Q$re// }
'

Notes:

Use s/\Q$re//g to remove anywhere in the line instead of just removing the prefix.
This works even with the header line includes special characters such as \, . and *.
This works even if there are multiple blank lines in a row.
See Specifying file to process to Perl one-liner for complete usage.
The line breaks in the code are optional (i.e. can be removed).

score 1 · Answer 4 · answered Oct 27 '18 at 21:33

A sed solution, mostly to illustrate that sed is probably not the best choice to do this:

$sed -E '1{h;b};/^$/{n;h;b};G;s/^(.*)(.*)\n\1$/\2/' infile
bar
fo
foo
fooo
foooo

sample
Text1
Text2
Text3

prefix
FooBar
BarFoo

Here is how it works:

1 {                   # on the first line
  h                   # copy pattern buffer to hold buffer
  b                   # skip to end of cycle
}
/^$/ {                # if line is empty
  n                   # get next line into pattern buffer
  h                   # copy pattern buffer to hold buffer
  b                   # skip to end of cycle
}
G                     # append hold buffer to pattern buffer
s/^(.*)(.*)\n\1$/\2/  # substitute

The complex part is in the substitution. Before the substitution, the pattern buffer holds something like this:

prefixFooBar\nprefix

The substitution now matches two capture groups, the first of which is referenced by what's between \n and the end of the string – the prefix we fetched from the hold buffer.

The replacement is then the rest of the original line, with the prefix removed.

Remarks:

This works with GNU sed; older GNU sed version might need -r instead of -E
-E is just for convenience; without it, the substitution would look like
```
s/^$.*$$.*$\n\1$/\2/
```
but still work.
For macOS sed, it works with literal linebreaks between commands:
```
sed -E '1{
h
b
}
/^$/{
n
h
b
}
G
s/^(.*)(.*)\n\2$/\2/' infile
```

steffen · Answer 5 · 2018-10-27T22:43:13.510

Here's another sed solution. It works only if all strings in a paragraph start with the subject line.

sed -e '1{h;b};/^$/{n;h;b};H;g;s/\(.*\)\n\1//;p;g;s/\n.*//;h;d' file

1 first line: h copy to hold space, b print and continue with next line
/^$/ empty lines: n print it and read next line, h copy to hold space, b print and continue
all (the other) lines:
- H append to hold space with newline
- g copy hold space to pattern space
- s/$.*$\n\1// remove first line and it's contents in the second line from pattern space
- p print pattern space
- g copy hold space to pattern space in order to remove the new contents from H
- /\n.*// remove the new contents
- h copy back to hold space
- d delete pattern space

sed is not useful for these things.

You get 'Invalid back reference' because there's no group in the search pattern of s.

score 1 · Answer 6 · answered Oct 28 '18 at 10:30

1

Another in awk:

$ awk '{if(p&&match($0,"^" p))$0=substr($0,RLENGTH+1);else p=$0}1' file

Output:

bar
fo
foo
fooo
foooo

sample
Text1
Text2
Text3

prefix
FooBar
BarFoo

answered Oct 28 '18 at 10:30

James Brown

36,089
7
43
59

steffen · Answer 7 · 2018-10-28T12:16:29.503

1

Here's another awk solution:

awk '{gsub(s,"")}1; s==""||!NF{s=$0}' file

Pros:

Matches are replaced, wherever they are
All matches are replaced
Head line may evaluate to 0/ false.
Head line may contain whitespace

Cons:

Head line must not contain regular expression meta chars

edited Oct 28 '18 at 12:16

answered Oct 28 '18 at 11:31

steffen

16,138
4
42
81

score 1 · Answer 8 · answered Oct 28 '18 at 14:29

This might work for you (GNU sed):

sed 'G;s/^\(.\+\)\(.*\)\n\1$/\2/;t;s/\n.*//;h' file

Append the previous key (or nothing if it is the first line) to the current line. Remove the key and the previous key if they match, print the current line and repeat. Otherwise the key did not match, remove the old appended key, store the new key in the hold space and print the new key.

Using sed/awk to remove string from subsections

8 Answers8