bash script. print one word before and after match

Question

Please help. i'm struggling with a line printing one word before and one word after the match. ideally is to make number of words variable, but at lease 1 is needed.

sample Input

https://suttacentral.net/sn45.78-82 1 Saṁyutta Nikāya 45.78–82 8. Dutiyaekadhammapeyyālavagga Sīlasampadādisuttapañcaka
  
sn45.78-82 yathayidaṁ, bīṁāṅñhikkhave, chandasampadā …pe…                                                              

https://suttacentral.net/sn45.8 4 Saṁyutta Nikāya 45.8 1. Avijjāvagga Vibhaṅgasutta
  
sn45.8 Idha, bhikkhave, bhikkhu anuppannānaṁ pāpakānaṁ akusalānaṁ dhammānaṁ īṁāṅñanuppādāya chandaṁ janetīṁāṅñi vāyamati vīriyaṁ ārabhati cittaṁ paggaṇhāti padahati,

expected Output

bīṁāṅñhikkhave, chandasampadā …pe… 
īṁāṅñanuppādāya chandaṁ janetīṁāṅñi

i don't know how to deal with symbols like **ī ṁ ā ṅ ñ ** etc

word related regexs don't handle these symbols properly

what i use

pattern=chand
 grep -oP '(?:\s*\D?\s*){0,'10'}'"$pattern"'(?:\s*\D?\s*){0,'10'}'

what i get

ve, chandasampadā …pe…
▒ya chandaṁ janeti

please advice some solution. Grep, sed, awk, whatever available on default centos (can't install other utils)

If your target word can appear multiple times on 1 line then you should include that case in your example. In particular include the case where 2 matching words are contiguous. Also include the cases where you want to print N words before/after the target but there aren't N words present in the input. — Ed Morton, Sep 05 '22 at 18:41
It's dangerous to use the word `pattern` in the context of pattern matching as it doesn't force you to think about what kind of pattern matching you actually want, string or regexp, and so can lead to insidious bugs when your "pattern" changes. See [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern). — Ed Morton, Sep 05 '22 at 18:49

Ed Morton · Answer 1 · 2022-09-05T18:51:22.763

Assuming that, as in the example you provided, the target word only appears once per input line or is separated by at least 2*num words from other occurrences of it:

$ regexp=chand

$ num=1

$ grep -Eo "(\S+\s+){,$num}\S*$regexp\S*(\s+\S+){,$num}" file
bīṁāṅñhikkhave, chandasampadā …pe…
īṁāṅñanuppādāya chandaṁ janetīṁāṅñi

$ num=2

$ grep -Eo "(\S+\s+){,$num}\S*$regexp\S*(\s+\S+){,$num}" file
yathayidaṁ, bīṁāṅñhikkhave, chandasampadā …pe…
dhammānaṁ īṁāṅñanuppādāya chandaṁ janetīṁāṅñi vāyamati

The above uses GNU grep for -o and \s/\S and assumes you want to do regexp matching as you're doing in the question rather than string matching.

markp-fuso · Accepted Answer · 2022-09-05T19:28:09.540

Assumptions:

words are delimited by white space
there could be more than one match on a line (including overlaps) and if so each set of output is to be printed on a new line

Adding the following lines to the end of OP's input file:

$ tail -2 file
chand1 chand2 chand3                         # multiple matches, overlaps, start/end of line matches
abc def ghi chand1 jkl mno chand2 pdq        # multiple matches, overlaps

One awk idea based on a hard-coded before/after count of 1:

awk -v ptn='chand' '
{ for (i=1;i<=NF;i++)
      if ($i ~ ptn)
         print (i>1 ? $(i-1) OFS : "") $i ($i<NF ? OFS $(i+1) : "")
}' file

This generates:

bīṁāṅñhikkhave, chandasampadā …pe…
īṁāṅñanuppādāya chandaṁ janetīṁāṅñi
chand1 chand2
chand1 chand2 chand3
chand2 chand3
ghi chand1 jkl
mno chand2 pdq

Expanding to handle a user-defined count of leading/trailing words:

awk -v ptn='chand' -v cnt1=1 -v cnt2=1 '
{ for (i=1;i<=NF;i++)
      if ($i ~ ptn) {
         sep=""
         for (j=i-cnt1;j<=i+cnt2;j++) {
             if (j<1 || j>NF) continue
             printf "%s%s", sep ,$j
             sep=OFS
         }
         print ""
      }
}' file

For cnt1=1 / cnt2=1 this generates:

bīṁāṅñhikkhave, chandasampadā …pe…
īṁāṅñanuppādāya chandaṁ janetīṁāṅñi
chand1 chand2
chand1 chand2 chand3
chand2 chand3
ghi chand1 jkl
mno chand2 pdq

For cnt1=2 / cnt2=2 this generates:

yathayidaṁ, bīṁāṅñhikkhave, chandasampadā …pe…
dhammānaṁ īṁāṅñanuppādāya chandaṁ janetīṁāṅñi vāyamati
chand1 chand2 chand3
chand1 chand2 chand3
chand1 chand2 chand3
def ghi chand1 jkl mno
jkl mno chand2 pdq

For cnt1=1 / cnt2=2 this generates:

bīṁāṅñhikkhave, chandasampadā …pe…
īṁāṅñanuppādāya chandaṁ janetīṁāṅñi vāyamati
chand1 chand2 chand3
chand1 chand2 chand3
chand2 chand3
ghi chand1 jkl mno
mno chand2 pdq

bash script. print one word before and after match

2 Answers2