3

I have a file with text as follows:

###interest1 moreinterest1### sometext ###interest2###
not-interesting-line
sometext ###interest3###
sometext ###interest4### sometext othertext ###interest5### sometext ###interest6###

I want to extract all strings between ### .

My desired output would be something like this:

interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

I have tried the following:

grep '###' file.txt | sed -e 's/.*###\(.*\)###.*/\1/g'

This almost works but only seems to grab the first instance per line, so the first line in my output only grabs

interest1 moreinterest1

rather than

interest1 moreinterest1
interest2
anubhava
  • 761,203
  • 64
  • 569
  • 643
Digsby
  • 151
  • 10

5 Answers5

2

Here is a single awk command to achieve this that makes ### field separator and prints each even numbered field:

awk -F '###' '{for (i=2; i<NF; i+=2) print $i}' file

interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6

Here is an alternative grep + sed solution:

grep -oE '###[^#]*###' file | sed -E 's/^###|###$//g'

This assumes there are no # characters in between ### markers.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks @anubhava, I made some edits to my example. Your solution almost works, but some of my stings between the `###` have spaces in them, which your solutions don't seem to accommodate. Any additional help would be greatly appreciated. – Digsby Jun 24 '21 at 15:03
  • 1
    The awk solution works, thanks! In my real file, I have some other text within regions of interest besides space characters, so that may be the reason why the grep/sed solution still isn't quite what I want. Thanks again for your help! – Digsby Jun 24 '21 at 15:12
  • Indeed `awk` is more robust solution that `grep + sed`. I will move it up. – anubhava Jun 24 '21 at 15:13
2

With GNU awk for multi-char RS:

$ awk -v RS='###' '!(NR%2)' file
interest1 moreinterest1
interest2
interest3
interest4
interest5
interest6
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

You can use pcregrep:

pcregrep -o1 '###(.*?)###' file

The regex - ###(.*?)### - matches ###, then captures into Group 1 any zero o more chars other than line break chars, as few as possible, and ### then matches ###.

o1 option will output Group 1 value only.

See the regex demo online.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1
sed 't x
s/###/\
/;D; :x
s//\
/;t y
D;:y
P;D' file

Replacing "###" with newline, D, then conditionally branching to P if a second replacement of "###" is successful.

0

This might work for you (GNU sed):

sed -n 's/###/\n/g;/[^\n]*\n/{s///;P;D}' file

Replace all occurrences of ###'s by newlines.

If a line contains a newline, remove any characters before and including the first newline, print the details up to and including the following newline, delete those details and repeat.

potong
  • 55,640
  • 6
  • 51
  • 83