3

I am trying to combine my understanding of dynamic regular expressions with awk's ability to print lines between two patterns in order to obtain lines between two patterns that could be bash variables. In this specific instance, the first pattern is a bash variable, and the other pattern is the following occurrence of a wildcard that begins with ">". The data looks something like:

CGCGCGCGCGCGCGCGCGCGCGCG
>jcf719000004955    0-783586
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>jcf_anything   0-999999
TATATATATATATATATATATATA
TATATATATATATATATATATATA

And I would like to obtain just:

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

So, using these variables:

i="jcf719000004955"
data="/bin/file"

Neither of these matching patterns work:

awk '/^\>$i/{f=1;next} /^\>.*/{f=0} f' $data
awk '/^\>$i/{f=0} f; /^\>.*/{f=1}' $data

I'm able to use dynamic regular expressions to get the matching pattern containing my bash variable as such:

awk -v var="$i" '$0 ~ var ' $data | head -1
>jcf719000004955    0-783586

But how do I combine the use of dynamic regular expressions in order to obtain the lines in between two variables/patterns?

anita
  • 177
  • 1
  • 9

3 Answers3

2

You can use the following gawk command:

i=jcf719000004955; awk -v var="$i" '$0~"^>"var{f=1; next}/^[^>]/{if(f)print;next}/^>/{if(f)exit}' input.txt

input:

CGCGCGCGCGCGCGCGCGCGCGCG
>jcf719000004955    0-783586
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>jcf_anything   0-999999
TATATATATATATATATATATATA
TATATATATATATATATATATATA 

output:

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

explanations:

  • -v var="$i" this is to pass a shell variable to your awk command in order to access it inside of your awk script.
  • by default variable are initiated to 0 in awk

the awk script:

# Rule(s)

$0 ~ ("^>"var) { #when the line starts with > and the value of your shell variabl
        f = 1 #set f to 1 
        next  #go to next line
}

/^[^>]/ { #when the line does not start with a >, 
        if (f) { #check if f is equal to 1
                print $0 #if it is the case it prints the whole line on your stdrout
        }
        next # jump to next line
}

/^>/ { #if we reach this point, it means that the line starts with > but has another value that what is stored in your variable so we reset
 if(f) { #if f was at 1 we have already passed by the printing section and we can exit
       exit
 }
}

test result:

enter image description here

Allan
  • 12,117
  • 3
  • 27
  • 51
  • Thank you! I got this to work (although I'm not sure if my intent is the most efficient solution to my problem). Explanations were pretty helpful. n00bie question: Why does `/^>/` work for `>_anything` in place of `/^>.*/`? (Why don't I need the wildcard?) – anita Jan 22 '18 at 09:46
  • @anita **think** about what `.*` **means**. Say the words to yourself. – Ed Morton Jan 22 '18 at 13:58
  • Is this actually gawk only, out of interest? I’m not so good on which bits are portable. – Guy Jan 23 '18 at 01:35
  • 1
    @Guy: I am not sure if it works with awk, since I do only use gawk ;-) but for what we do here since it's very standard operations I suppose it should work – Allan Jan 23 '18 at 01:42
  • @anita: it is not required to add `.*`since we are already fetching the line that starts with `>` you could also add it if you want. Doesn't really matter Last but not least, if my answer solved your issue you can put it as **correct answer** by clicking on the check on the left of it. :-) – Allan Jan 23 '18 at 01:45
1

You can try this one too

awk -F'\n' -v RS='>' -v i="$i" '$1 ~ i {for(j=2;j<NF;j++) print $j}' infile
ctac_
  • 2,413
  • 2
  • 7
  • 17
1

Following awk could help you in same too.

i="jcf719000004955"
data="/bin/file"
awk -v val="$i" '/^>/{match($0,val);if(substr($0,RSTART,RLENGTH)){flag=1} else {flag=""};next} flag' "$data"

Output will be as follows.

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

Explanation: Adding explanation for above code too now.

i="jcf719000004955"              ##Setting variable named i value as per OP mentioned.
data="yout_file"                 ##Setting value for variable named data to the Input_file for awk here in data shell variable.
awk -v val="$i" '                ##Setting variable named val for awk who has value of variable i in it. In awk we define variables by -v option.
/^>/{                            ##Checking condition here if a line is starting from > then do following:
  match($0,val);                 ##Using match function of awk where we are trying to match variable val in current line, if it is TRUE then 2 variables named RSTART and RLENGTH for math function will be having values. RSTAR will have the index of matching regex and RLENGTH will have complete length of that matched regex.
  if(substr($0,RSTART,RLENGTH)){ ##Checking here if substring is NOT NULL which starts from RSTART to RLENGTH, if value is NOT NULL then do following:
    flag=1 }                     ##Setting variable flag value to TRUE here.
  else{                          ##In case substring value is NULL then do following:
    flag=""};                    ##Setting variable flag value to NULL.
next                             ##next is awk out of the box keyword which will skip all further statements now.
}
flag                             ##Checking condition here if variable flag value is NOT NULL and NOT mentioning any action, so by default print of current line will happen.
' "$data"                        ##Mentioning the value of variable data with double quotes as this is having Input_file value which awk will read.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93