2

I have a very big Sequence Alignment Map (SAM) file as depicted below

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL
 
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch6 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch2 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch5 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0

I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt which is working fine. Now I want to keep the lines

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL

as header on top of all the splitted files, how can I do that?

I tried this:

awk '$1 ~ /^@/ {print > $3}'  file.txt
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Somu
  • 25
  • 3
  • 2
    Is there some reason why you don't want to use [samtools](http://www.htslib.org/)? E.g. `samtools view -h file.sam chr1 > chr1.sam` should provide the header lines and chr1 entries. – jared_mamrot Feb 16 '23 at 11:35
  • Yes that can also be used. – Somu Feb 16 '23 at 12:26
  • 1
    When you say `$3` in your question - do you mean the actual `$3` of `117` or did you really mean `$4` which contains those `ch` strings? – Ed Morton Feb 16 '23 at 12:44
  • 1
    Do not reinvent the wheel. For common bioinformatics tasks, use open-source tools that are specifically designed for these tasks, are well-tested, widely used, and handle edge cases. Here, use [`samtools`](https://github.com/samtools/samtools), which can be easily installed with [`conda`](https://conda.io/projects/conda/en/latest/user-guide/install/index.html), see the comment from *jared_mamrot*. Also, see: [How To Split A Bam File By Chromosome](https://www.biostars.org/p/46327/) and [split sorted bam file chromosome wise](https://www.biostars.org/p/405004/) – Timur Shtatland Feb 16 '23 at 15:16

2 Answers2

3

You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.

awk '$1 ~ /^@/ { header = header $0 ORS; next }
   !seen[$3]++ { printf "%s", header >$3 }
   { print > $3 }' file.txt

The internal variable ORS usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.

This can run out of file handles if you have more than a couple of dozen distinct values in $3 but if your script otherwise works, I guess that's not a problem in your case.

(The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
-1

If "header lines" always contains 3 fields then criteria can be as follows:

For lines containing more than 3 fields, set first and second field as ""; else, print line as it is:

file.txt used:

cat file.txt
@X   YYYYYY ZZZZZ\
@X   ssssss ddddd\
@X   CCCCCC LLLLL

> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch6 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch2 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch5 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0

awk:

awk '{ if( NF > 3) $1=$2=""; print }' file.txt
@X   YYYYYY ZZZZZ\
@X   ssssss ddddd\
@X   CCCCCC LLLLL

117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0