-1

Hi I have a similar situation with Grep group of lines, but slightly different.

I have a file in the format of:

> xxxx AB=AAA NNN xxxx CD=DDD xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA JJJ xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA NNN xxxx CD=FFF xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
xxx

(each item starting with > does not necessarily contain same number of lines with xxx, xxx is a list of string with all capital letters, the only cue that the record of this item is completed is that the next line starts with >)

Firstly, I want to grep all items with AB = EEE FFF as a resultant file like below:

>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=TTT xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx

Then, I have a csv file with list of CD items, and I want to grep all these with CD=xxx as xxx is a line in csv file.

A sample of an item is:

>sp|P01023|A2MG_HUMAN Alpha-2-macroglobulin OS=Homo sapiens OX=9606 GN=A2M PE=1 SV=3
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTV
SASLESVRGNRSLFTDLEAENDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTV
MVKNEDSLVFVQTDKSIYKPGQTVKFR

AB in my example refers to OS here, and CD in my example refers to GN (so it's a single string containing capital letters AND/OR number

My csv file looks like (with ~1000 lines):

A2M
AIF1

Thanks a lot!

Joy Zheng
  • 49
  • 6
  • You could use `awk` and set the record separator to `>` . This makes awk treat your file as a set of multiline blocks. – user1934428 Jan 31 '23 at 07:34

1 Answers1

0

Your question doesn't have much in the way of testable sample data but something like this might be a starting point:

awk -v s1='AB=EEE FFF' -v s2='CD' -v out='out.dat' '
    /^>/ {
        if ( ok = index($0,s1) )
            for ( i=1; i<=NF; i++ )
                if ( index($i, s2"=")==1 )
                    print substr( $i, index($i,"=")+1 )
    }
    ok { print >out }
' in.dat |\
grep -Fx -f - in.csv >out.csv

use awk to process in.csv:

  • look for lines starting > and if found:
    • set flag based on presence/absence of desired string s1 (flag will remain set until re-tested at next > line)
    • if s1 present, search for a field that starts with string s2 followed by =
      • if found, write section after = to stdout
        • (for efficiency, one could break out of the for here)
  • if ok flag is set, copy the line to out.dat

awk's stdout is piped into grep.

use grep to search for fixed strings listed in awk's output that match an entire line of in.csv, and save results to out.csv

jhnc
  • 11,310
  • 1
  • 9
  • 26