Find and append to Text Between Two Strings or Words using sed or awk

Question

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.

For instance, the 1st indicator is a list of words

(no|noone|haven't)

and the 2nd indicator is a list of punctuation Code:

(.|,|!)

From an input text such as

"Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

The desired result would be.

"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?"

I know that there is the following sed:

sed -n '/WORD1/,/WORD2/p' /path/to/file

which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.

I have also considered to use awk, such as

awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile

yet still, it does not allow me to append the affix.

Does anyone have a suggestion to do so, either with awk or sed?

A. The sed syntax you're relying on, relies on lines to delimit the range, so it won't work on text that is in the same line. B. You'll do better searching 1 word at a time in `awk` and managing the output there. C. your example seem inconsistent. The application of your "rule" to `Noone understands the plot` produced `Noone understands_AFFIX me_AFFIX.` where as `There is no storyline.` produced `There is no storyline_AFFIX` . How did `AFFIX` get inserted 2x in the first and only 1x in the 2nd? Good luck. — shellter, Dec 09 '15 at 09:47

choroba · Answer 1 · 2015-12-09T10:01:40.013

1

Perl to the rescue!

perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
            join " ", map "${_}_AFFIX", split " ", $1/egi
         ' infile > outfile

\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.

edited Dec 09 '15 at 10:01

answered Dec 09 '15 at 09:53

choroba

231,213
25
204
289

I didn't even think of using `perl` for this problem. Just one question. In regards to the indicators, are they searching for real words (i.e. -w) or just if they find the part of the indicator word in a word? – owwoow14 Dec 09 '15 at 09:57
@owwoow14: As written here, they match as a substring, too. You can add word boundaries `\b` if needed. – choroba Dec 09 '15 at 10:02

score 1 · Answer 2 · answered Dec 09 '15 at 10:13

Here is one verbose awk command for the same:

s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
   a=0
   for (i=2; i<=NF; i++) {
      if ($(i-1) ~ "\\y" kw "\\y")
         a=1
      if (a && $i ~ pct "$") {
         p = substr($i, length($i), 1)
         $i = substr($i, 1, length($i)-1)
      }
      if (a)
         $i=$i "_AFFIX" p
      if(p) {
         p=""
         a=0
      }
   }
} 1'

Output:

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

score 1 · Accepted Answer · answered Dec 09 '15 at 15:36

Little more compact awk

$ awk              'BEGIN{RS=ORS=" ";s="_AFFIX"} 
                 /[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}  
                        f{$0=$0s} 
    /Noone|no|haven'\''t/{f=1}1' story

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Find and append to Text Between Two Strings or Words using sed or awk

3 Answers3