1

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.

For instance, the 1st indicator is a list of words

(no|noone|haven't)

and the 2nd indicator is a list of punctuation Code:

(.|,|!)

From an input text such as

"Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

The desired result would be.

"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?"

I know that there is the following sed:

sed -n '/WORD1/,/WORD2/p' /path/to/file

which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.

I have also considered to use awk, such as

awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile

yet still, it does not allow me to append the affix.

Does anyone have a suggestion to do so, either with awk or sed?

Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43
  • 1
    A. The sed syntax you're relying on, relies on lines to delimit the range, so it won't work on text that is in the same line. B. You'll do better searching 1 word at a time in `awk` and managing the output there. C. your example seem inconsistent. The application of your "rule" to `Noone understands the plot` produced `Noone understands_AFFIX me_AFFIX.` where as `There is no storyline.` produced `There is no storyline_AFFIX` . How did `AFFIX` get inserted 2x in the first and only 1x in the 2nd? Good luck. – shellter Dec 09 '15 at 09:47

3 Answers3

1

Perl to the rescue!

perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
            join " ", map "${_}_AFFIX", split " ", $1/egi
         ' infile > outfile
  • \K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
  • /e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
choroba
  • 231,213
  • 25
  • 204
  • 289
  • I didn't even think of using `perl` for this problem. Just one question. In regards to the indicators, are they searching for real words (i.e. -w) or just if they find the part of the indicator word in a word? – owwoow14 Dec 09 '15 at 09:57
  • @owwoow14: As written here, they match as a substring, too. You can add word boundaries `\b` if needed. – choroba Dec 09 '15 at 10:02
1

Here is one verbose awk command for the same:

s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"

awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
   a=0
   for (i=2; i<=NF; i++) {
      if ($(i-1) ~ "\\y" kw "\\y")
         a=1
      if (a && $i ~ pct "$") {
         p = substr($i, length($i), 1)
         $i = substr($i, 1, length($i)-1)
      }
      if (a)
         $i=$i "_AFFIX" p
      if(p) {
         p=""
         a=0
      }
   }
} 1'

Output:

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Little more compact awk

$ awk              'BEGIN{RS=ORS=" ";s="_AFFIX"} 
                 /[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}  
                        f{$0=$0s} 
    /Noone|no|haven'\''t/{f=1}1' story

Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

karakfa
  • 66,216
  • 7
  • 41
  • 56