1

I have a file with a sudden structure, and when the structure is not met I would like to delete those lines. So the structure should be: 1) a line starting with the word "Sequence", 2) a line starting with the word "Start", 3) a line starting with a number.

Now in my file some line do not have the number, but do have the first two lines (the number line was removed with grep). I hope to find a way with awk or sed, to remove the two preceding lines when there is no number line. Hope this is possible?

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

Expected output:

cat file.txt
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
benn
  • 198
  • 1
  • 11
  • 2
    Can you show expected output and your attempt at it? – anubhava Aug 08 '18 at 13:46
  • Sigh. This would be utterly trivial to tweak the awk script in [my previous answer](https://stackoverflow.com/a/51745195/1745001) to do. This is what I was trying to warn you about wrt using sed for tasks like this - now with the tiniest requirements change from your previous question you need a completely different solution. – Ed Morton Aug 08 '18 at 17:51

4 Answers4

2

You may use this awk command:

awk '/^[0-9]+/ && NR==a["Sequence:"]+2 && NR==a["Start"]+1 {
   print r["Sequence:"] ORS r["Start"] ORS $0
}
/^(Sequence:|Start)/ {
   a[$1]=NR
   r[$1]=$0
}' file

Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    This is working very well, thanks. What manual/reference do you use for awk? I would like to use awk more, but hard to find good reference. – benn Aug 08 '18 at 14:07
  • 1
    Start with http://www.grymoire.com/Unix/Awk.html and https://www.gnu.org/s/gawk/manual/gawk.pdf then you can check some `awk` answers on SO itself. – anubhava Aug 08 '18 at 14:29
  • 2
    @b.nota you'll find manual and learning resources in tag wikis... https://stackoverflow.com/tags/awk/info and https://stackoverflow.com/tags/sed/info – Sundeep Aug 08 '18 at 14:41
1
% awk '
  $1 == "Sequence:" {seq   = $0}
  $1 == "Start"     {start = $0}
  $1 ~ /^[0-9]*$/ && l "Start" && L == "Sequence:" {print seq;print start;print}
  {L = l;}
  {l = $1}' file.txt
keithpjolley
  • 2,089
  • 1
  • 17
  • 20
1

for files that can fit within memory, you can slurp entire file and process

perl -0777 -pe 's/^Sequence.*\nStart.*\n(?!\d)//m' ip.txt
  • -0777 slurp entire file
  • m flag, so that ^ and $ anchors will work in multi-line string as well
  • ^Sequence.*\nStart.*\n(?!\d) match ^Sequence.*\nStart.*\n only if it is not followed by a digit. Note that . will not match newline character unless s flag is used

Alternatively, you could match and print only the correct grouping

perl -0777 -ne 'print /^Sequence.*\nStart.*\n\d.*\n/mg' ip.txt
Sundeep
  • 23,246
  • 2
  • 28
  • 103
1

To ONLY print the 3-line records all you need is:

$ cat tst.awk
/^Sequence:/ { lineNr=0; rec="" }
{ rec = (++lineNr > 1 ? rec ORS : "") $0 }
lineNr == 3 { print rec }

For example:

$ awk -f tst.awk file
Sequence: HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
Start     End  Strand Pattern                 Mismatch Sequence
217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___     from: 1   to: 301
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca
Sequence: X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
176     184       + pattern:AA[CT]NNN[AT]CN        . aatactaca

but for a far more useful approach to analyzing your data, look again at the script at the bottom of my answer to your previous question. To tweak that to discard records than have less than 3 lines all you need to do is move the lineNr=0 setting from inside the lineNr==3 block to a new /Sequence:/ block and the script will continue to work to give an array that you can access fields by their names:

$ cat tst.awk
/^Sequence:/ { lineNr = 0 }

++lineNr == 1 {
    delete fldNr2tag
    delete tagNr2tag
    delete tag2val
    numTags = 0

    for (i=1; i<=NF; i+=2) {
        sub(/:.*/,"",$i)
        tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
        val = $(i+1)
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }
}
lineNr == 2 {
    for (i=1; i<=NF; i++) {
        tag = $i
        fldNr2tag[i] = tag
    }
}
lineNr == 3 {
    for (i=1; i<=NF; i++) {
        tag = fldNr2tag[i]
        val = $i
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }

    prt()
}

function prt(   tagNr, tag, val) {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tagNr2tag[tagNr]
        val = tag2val[tag]
        printf "tag2val[%s] = <%s>\n", tag, val
    }
    print "----"
}

.

$ awk -f tst.awk file
tag2val[Sequence1] = <HM855457_IGHV1-8*02_Homosapiens_F_V-REGION_24..319_296nt_1_____296+0=296__rev-compl_>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <L21969_IGHV2-70*01_Homosapiens_F_V-REGION_144..444_301nt_1_____301+0=301___>
tag2val[from] = <1>
tag2val[to] = <301>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatactaca>
----
tag2val[Sequence1] = <X92241_IGHV2-70*02_Homosapiens_F_V-REGION_144..433_290nt_1_____290+0=290_partialin3'__>
tag2val[from] = <1>
tag2val[to] = <290>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatactaca>
----

If all you wanted was to print the input lines as-is it'd be even more trivial but I really think the above is what you're going to want going forward to add various combinations of comparisons and outputs.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185