0

I have a file and would like to use grep to exclude a pattern. But I would also like to remove the 2 preceding lines for every match (that is excluded). How do I do this?

What I have tried:

cat file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
  Start     End  Strand Pattern                 Mismatch Sequence
    217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___     from: 1   to: 301
  Start     End  Strand Pattern                 Mismatch Sequence
    176     184       + pattern:AA[CT]NNN[AT]CN        . aatcctaca

# With grep -v I can remove the line with pattern

grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence

# But using -B 2 does not work here

grep -B 2 -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" file.txt
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___ from: 1 to: 296
Start End Strand Pattern Mismatch Sequence
217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc
Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301
Start End Strand Pattern Mismatch Sequence

Any ideas how to remove the 2 preceding lines as well for every match?

benn
  • 198
  • 1
  • 11
  • Possible duplicate of [How do I delete a matching line and the previous one?](https://stackoverflow.com/questions/7378692/how-do-i-delete-a-matching-line-and-the-previous-one) - the question has `-B 1` instead of `-B 2` but the answers will apply straightforwardly anyway. – tripleee Aug 08 '18 at 10:14
  • The example file has a clear record structure to it, and I'd be wary of trying to use line-oriented command line tools, like `grep` and `sed`, to hack something together. Looking at it, I'd be tempted to write a Perl script to parse the Sequence records apart and match on those. – Jon Aug 08 '18 at 10:19
  • @tripleee, thank you for directing me to the possible duplicate. The best answer in there works for one preceding line, but not for 2. – benn Aug 08 '18 at 10:40
  • @b.nota Try replacing "1d" with "2d", if you're refering to the `sed` answer. – confetti Aug 08 '18 at 10:41
  • not correct with 2d: `sed -n '/[acgt]\{3\}cc[acgt][acgt]\{3\}/{n;x;d;};x;2d;p;${x;p;}' file.txt Start End Strand Pattern Mismatch Sequence 217 225 + pattern:AA[CT]NNN[AT]CN . aacacctcc Sequence: M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___ from: 1 to: 301` – benn Aug 08 '18 at 10:42
  • @b.nota are the groupings always 3 lines? if so, it would be easier to accumulate 3 lines at a time and then filter them based on condition – Sundeep Aug 08 '18 at 10:48
  • @Sundeep, yes they are always 3 lines – benn Aug 08 '18 at 10:49
  • The `tac` comment and the Awk answer don't look hard to adapt to your scenario. – tripleee Aug 08 '18 at 10:51
  • @tripleee, FYI tac command replacing +1d with +2d does not work. AWK is too hard to adapt (where change the 2 lines instead of 1??, please tell me if you know) – benn Aug 08 '18 at 11:00
  • Awk is vastly easier to adapt than sed. If you're doing `g/re/p` then use `grep`. If you're doing `s/regexp/backref-string/` then use `sed`. For anything else, just use awk for improved clarity, robustness, portability, performance, maintainability, etc., etc. – Ed Morton Aug 08 '18 at 13:18

3 Answers3

2

Tested on GNU sed, syntax/feature might vary with other implementations

sed -E 'N;N; /[acgt]{3}cc[acgt][acgt]{3}/d' ip.txt
  • -E use ERE, some sed versions require -r instead of -E
  • N;N append two more lines to pattern space
  • /[acgt]{3}cc[acgt][acgt]{3}/d delete if this condition matches
    • note that this would try to match the regex anywhere in the three lines... also, [acgt][acgt]{3} could be simplified to [acgt]{4}
    • /\n.*\n.*[acgt]{3}cc[acgt][acgt]{3}/d will restrict to matching only 3rd line
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • Great solution, thanks you. You are right about the `{4}` part, but for codons (in biology) this `{3}` more logical. – benn Aug 08 '18 at 11:09
  • @b.nota just curious - have you considered how you'll adapt that solution when you need so skip, say, 10 lines instead of 3 or when you need to print the first of the lines or the last of them or print a count of blocks skipped or do just about anything else or test the regexp only in one location/field or you move to a platform that doesn't have GNU sed or....? Obviously I'm suggesting you shouldn't even be considering using sed for this - it's simply a job for awk. – Ed Morton Aug 08 '18 at 15:58
  • @EdMorton, I try to adapt from scripts/code that work for me, where I can understand the basics, I think a good the explanation in an answer is essential. I hope you realize I am not working with `sed`, `awk`, and `grep` on a daily basis, but I like to use (and learn more about) these tools. – benn Aug 08 '18 at 16:40
  • Understood but just like in life it makes sense to use the right tool for the right job and learn to use them as such. Investing time into becoming the best you can be at using a screwdriver to cut down trees is completely doable but most people would advise against it and might suggest a saw. All I'm saying is learn which tools are right for which jobs and THEN learn how to use them each for what they do best, don't pick one tool that you know how to use for one task and assume you should use it for other tasks even if it is possible to do so. – Ed Morton Aug 08 '18 at 17:32
  • I'm definitely not suggesting you change your accepted answer btw, just trying to steer you on the right path going forward for how to think about UNIX tools and what the big 3 text processing tools of grep, sed, and awk, are best used for. – Ed Morton Aug 08 '18 at 17:45
2

All you need is:

tac file | awk '/regexp/{c=3} !(c&&c--)' | tac

Obviously set regexp to whatever regexp you want to match on and change 3 to however many lines you want to skip including the matching line. e.g. to skip every line containing 7 and the 4 lines before it:

$ seq 20 | tac | awk '/7/{c=5} !(c&&c--)' | tac
1
2
8
9
10
11
12
18
19
20

See https://stackoverflow.com/a/17914105/1745001 for how to print whatever lines you like around a matching line.

With your example:

$ tac file | awk '/[acgt]{3}cc[acgt][acgt]{3}/{c=3} !(c&&c--)' | tac
Sequence: MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___     from: 1   to: 296
  Start     End  Strand Pattern                 Mismatch Sequence
    217     225       + pattern:AA[CT]NNN[AT]CN        . aacacctcc

Now, something you might want to consider for your data:

$ cat tst.awk
++lineNr == 1 {
    delete fldNr2tag
    delete tagNr2tag
    delete tag2val
    numTags = 0

    for (i=1; i<=NF; i+=2) {
        sub(/:.*/,"",$i)
        tag = $i (i>1 ? "" : 1) # to distinguish the 2 "Sequence" tags
        val = $(i+1)
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }
}
lineNr == 2 {
    for (i=1; i<=NF; i++) {
        tag = $i
        fldNr2tag[i] = tag
    }
}
lineNr == 3 {
    for (i=1; i<=NF; i++) {
        tag = fldNr2tag[i]
        val = $i
        tagNr2tag[++numTags] = tag
        tag2val[tag] = val
    }

    prt()

    lineNr = 0
}

function prt(   tagNr, tag, val) {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tagNr2tag[tagNr]
        val = tag2val[tag]
        printf "tag2val[%s] = <%s>\n", tag, val
    }
    print "----"
}

.

$ awk -f tst.awk file
tag2val[Sequence1] = <MG719312_IGHV1-8*03_Homosapiens_F_V-REGION_127..422_296nt_1_____296+0=296___>
tag2val[from] = <1>
tag2val[to] = <296>
tag2val[Start] = <217>
tag2val[End] = <225>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aacacctcc>
----
tag2val[Sequence1] = <M99648_IGHV2-26*01_Homosapiens_F_V-REGION_164..464_301nt_1_____301+0=301___>
tag2val[from] = <1>
tag2val[to] = <301>
tag2val[Start] = <176>
tag2val[End] = <184>
tag2val[Strand] = <+>
tag2val[Pattern] = <pattern:AA[CT]NNN[AT]CN>
tag2val[Mismatch] = <.>
tag2val[Sequence] = <aatcctaca>
----

Note that with the above you can access every value by it's name and so remove an imprecision and/or false matches from comparisons or other calculations and you can select specific fields to print in whatever order you like just by using the field name, e.g. print tag2val["Sequence"], tag2val["Pattern"]. So you can trivially convert your data to a CSV for import into Excel or convert to HTML or JSON or do just about anything else with it.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • For this does not work: `tac file.txt | awk '/[acgt]\{3\}cc[acgt][acgt]\{3\}/{c=3} !(c&&c--)' | tac` – benn Aug 08 '18 at 11:16
  • Right because `[acgt]\{3\}cc[acgt][acgt]\{3\}` is not a valid ERE, nor is it even a POSIX BRE, it's a non-portable version of a BRE with added constructs to activate ERE metacharacters that will only work in GNU sed when invoked without the -E or -r argument. awk simply works with POSIX EREs. I updated my answer to show it working with a valid regexp. – Ed Morton Aug 08 '18 at 11:18
  • 1
    Thanks for the help. – benn Aug 08 '18 at 11:24
  • You're welcome. When you have name-value pairs in your input, though, the best approach is usually to create an array that maps the names to their values first and then access the values by their names. So, I updated my answer to show you a script that does that. It's **really** how you should be working on your data so if you don't understand the benefits then please feel free to ask. – Ed Morton Aug 08 '18 at 12:03
1

Looking at the example file, it appears to have a record-oriented structure, so I'd be very wary of attempting to manipulate it using line-oriented tools such as grep and sed. As pointed out in the comments, there is already a similar problem in with a solution in sed, but the script isn't pretty and would be a nightmare to maintain or extend.

I'd be tempted to write a short Perl or Python script to parse the file into records and then work with the records. I don't know the details of the file format, but something like the following is probably a good start, and produces the output you want.

#!/usr/bin/perl -w

use strict;

my $line = <>;
unless (defined($line) && $line =~ /^Sequence/) {
    die "expected line to start with Sequence";
}
while (defined($line)) {
    my $record = $line;
    $line = <>;
    while (defined($line) && $line !~ /^Sequence/) {
        $record .= $line;
        $line = <>;
    }
    print $record unless $record =~ /[acgt]{3}cc[acgt][acgt]{3}/;
}
Jon
  • 3,573
  • 2
  • 17
  • 24