Extract substrings between strings

Question

I have a file with text as follows:

###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###

I want to extract all strings between ### .

My desired output would be something like this:

interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

I have tried the following:

grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'

This almost works but only seems to grab the first instance per line, so the first line in my output only grabs

interest1 moreinterest1

rather than

interest1 moreinterest1
interest2

anubhava · Accepted Answer · 2021-06-24T15:27:17.230

2

Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:

awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file

interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

Here is an alternative grep + sed solution:

grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'

This assumes there are no # characters in between ### markers.

edited Jun 24 '21 at 15:27

answered Jun 24 '21 at 14:54

anubhava

761,203
64
569
643

Thanks @anubhava, I made some edits to my example. Your solution almost works, but some of my stings between the `###` have spaces in them, which your solutions don't seem to accommodate. Any additional help would be greatly appreciated. – Digsby Jun 24 '21 at 15:03
1

The awk solution works, thanks! In my real file, I have some other text within regions of interest besides space characters, so that may be the reason why the grep/sed solution still isn't quite what I want. Thanks again for your help! – Digsby Jun 24 '21 at 15:12
Indeed `awk` is more robust solution that `grep + sed`. I will move it up. – anubhava Jun 24 '21 at 15:13

score 2 · Answer 2 · answered Jun 24 '21 at 16:28

2

With GNU awk for multi-char RS:

$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

answered Jun 24 '21 at 16:28

Ed Morton

188,023
17
78
185

score 1 · Answer 3 · answered Jun 24 '21 at 15:13

You can use pcregrep:

pcregrep -o1 '###(.*?)###' file

The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.

o1 option will output Group 1 value only.

See the regex demo online.

score 1 · Answer 4 · 2021-06-24T17:48:40.923

1

sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file

Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.

edited Jun 24 '21 at 17:48

answered Jun 24 '21 at 16:05

score 0 · Answer 5 · answered Jun 25 '21 at 12:59

This might work for you (GNU sed):

sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file

Replace all occurrences of ###'s by newlines.

If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

Extract substrings between strings

5 Answers5