split file by only lines sep containing "|"

Question

I have a huge file structured :

>ABC_123|XX|YY|ID
CNHGYDGHA
>BBC_153|XX|YY|ID
ACGFDRER

I need to split this file by based on first value on line

File1: ABC_123 -> should contain 
          >ABC_123|XX|YY|ID
          CNHGYDGHA

File2: BBC_153 -> should contain 
         >BBC_153|XX|YY|ID
          ACGFDRER

Check out `grep`, it's the right tool for extracting certain lines based on a pattern. — Ulrich Eckhardt, Mar 19 '18 at 09:43
Is there any possibility that `ABC_123` occurs again, later in the file? — Tom Fenech, Mar 19 '18 at 09:45
So what would be the desired output if more than one line started with `>ABC_123`? Please edit your question to show us. — Tom Fenech, Mar 19 '18 at 09:56

score 0 · Answer 1 · answered Mar 19 '18 at 09:49

This produces two files ABC_123 and BBC_153 from your input:

awk -F'|' 'NF > 1 { # when more than one field (i.e. line contains | )
    close(out)      # close the previous file (or do nothing, if none were open)
    out = $1        # assign first field to filename
    sub(/^>/, "", out) # remove the > from the start of the name
} 
{ print >> out }' file # print to the file, opening in append mode if needed

If you are sure that the filenames will only be opened once, then you can use > instead of >>.

Understood, Thanks for detailed. Will try – bio-code's Mar 19 '18 at 10:00 — bio-code's, Mar 19 '18 at 10:00

score 0 · Answer 2 · answered Mar 19 '18 at 09:53

0

awk approach:

awk -F'|' '/^>.+\|/{ fn = substr($1, 2) }{ print > fn }' file

Viewing 2 created sample files:

$ head [AB]BC_*
==> ABC_123 <==
>ABC_123|XX|YY|ID
CNHGYDGHA

==> BBC_153 <==
>BBC_153|XX|YY|ID
ACGFDRER

answered Mar 19 '18 at 09:53

RomanPerekhrest

88,541
4
65
105

Useless to a beginner without an explanation, and depending on the number of headers in the OP's "huge" file, it may be necessary to close pipes. – Tom Fenech Mar 19 '18 at 09:57

split file by only lines sep containing "|"

2 Answers2