2

I have a fastq file below and I want to split the file by lane=$2. My code does the job of splitting it, but I also want the output files to have $SM variable appended to them. Can someone please let me know what I am missing in my command?

SM="sample1"
awk 'BEGIN {FS = ":"} {lane=$2 ; print > "${SM}."lane".fastq" ; for (i = 1; i <= 3; i++) {getline ; print > "${SM}."lane".fastq"}}' < File.fastq

File.fastq

@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
@HS2000-1015_160:7:1215:18781:100685/2
ATAAAACAGTAAACAAAATAAAGTCAGTTTTTTTTTTTTTTTTTAAAGAACAAAATGAAACTTGAGGGAAAACTTCATGGAGTTACAGTTTATCCTGATA
+
CCCFFFFFHFHHFJJJJIIGIGI<CFHHIIJJJJJIJJHFDDD=ACC(38+9CB?:(>C(+:@>(4?05<?C?###########################
@HS2000-1015_160:6:1215:6292:43622/2
GGGTCCTGAGACCTGAGGGACCATTGGCCCTCTTCTGGCTTGCTTATCCTTTGTACCTGATGGCCAATGAATGTCAGAGATGGTCCTGTCTCCATCCAGT
+
BCCDFFFFHGHHHJJJIIJJJJJIJJJGIJIJJIHIIJJIEFHEIJJJJIGIGIIIIIJHFHIJJJJIHGHEC?BCEFFFEECCCEACCCCDDDDDDCCC
@HS2000-1015_160:7:2311:1291:4696/2
GATCTGGTGCTCGTATTCCATCCACCTCCCAAGCTATACATAATAACGGCCAAAGGACCTGGATGAAAGTGTCTGAAGCAGTTGTGTGTGTCTCACCTTC
+
?=?ABDDBCFDFHGGHBFCHHGD@GFDGCBDFGFFECCHHD@DDFHJEIIHGG3CE9C(7@E(.7=?;;@C?@ECA>@C3A(;A-5595<9:AC3@AC:A
@HS2000-1015_160:7:1205:18979:53766/2
TCTTGTTTTGACCAATAGTAAAGCACATTTCTCTAATTTGGATTTCTACAATATCCATATCTTGGTTTATGAAAGGTAGGGAAGAGACTTCAGGTACTGC
+
CCCFFDFFHHHHHIJIJJJIHIJHJJIJJJJIJIGIIIJJJJJHJIJJIJDHIJIIIIIJJJJIJGIJJJIIIGEEGCD@AHHFFEDFFCDDDDCCDD@C
@HS2000-1015_160:7:1205:5641:24287/2
ATAAGAAGGGAAGAATGATTAGGTGTCAAATGTTCTTTTTATTTTCTTTCAGTTCAATGCAAAAACTTTCCAGTGATTATGTAAATGCAGAATCATGTGG
+
CCCFFFFFHHHGHJIJJFJJGIGEHEHIJJJJGJGJJIJJJJJJJJJJJIJIIIJJJIJJIEHGIHGJJJJIGGGHIIIIEEEHCHHC>DFBEEA@CCCC
@HS2000-1015_160:7:1310:19879:73973/2
TTCTTGAGTTCTGATACCTGTTTCCACAATCGTTTCTGTTTCTGTTGTCTCCAGCCCATCCATGCTGTCCTCATCTTCCACTGCAGTTTTCACCCTACTT
+
@<@FFFDFHHH>FGGIJAEFHABHHIAGHAE=F@EF?FB@F:F<GGBGEHGGG9F=BGAGIIIHH;=.=CHG@CEHE3)7?=>)7@C>)(6(.6;A?ACC
@HS2000-1015_160:7:1215:4243:29984/2
ATCTACACCCAAAACAGAACTTTCACAAAAAAACTGTTGATACGAAGCTCATGAAAATCATGATGAATACTCCAACAATTAATGAATAAAACTATACAAT
+
;@@A;D;ADDFHFIIF3EG@A>ACEHE>EH=:DH@<9DB@F?B7C87'@)=)7@>@7==)7...).;?@C)6;((;(5;(>A:(:3;@3>:@>:@(4@::
@HS2000-1015_160:7:1314:6987:62989/2
ATAGCTGTCTGTTCAGAGTCTGATGTTTTCAGTAACACTCTTGATACATTAAGTGAGATAGAATGGAATCCAGCAACAAAGCTACTAAATCAGGTAACTT
+
C@CFFFFDHHHHHJIJJJBHHIIIIHJIJHGJIJJIEHGHJJIJJJJJJJJIGBGHHIJGHGIIHJJIJIIJIGIGHIGGGCHHHHBEFCCEFE>CCEEE
@HS2000-1015_160:6:1208:20370:97766/2
TTTACTTTTTCCCAAACAATAATGATGATAATGTGGCCATACTGGTGCATGAGGGCTCTTATTAAGGATAGGGGCCATGTCAGGCTCTATTGACTCCTAT
+
CCCFFFFFDHDFHJJJIJJJIIJGHJJJIIIIGHIJJIJJJIJIHIJJIIHGHIFHIFHJGIJJIJJJJJJJJHHHFFFFFEEEEEDDCDEDDDDDDCDD
@HS2000-1015_160:6:1108:20693:2521/2
CCCATTTTCTGATGAGGAAACAGGATCAGGGACATTGAGACCTACCAAAGTTACATAATACCAGTAGTAGAAATGGGACTTCAACACAGGCCTCTTGACT
+
7@@DDDDDHHHBDIGIB@F?A+AF@3+2AFE@1:BFE??HH6?BG9BD99??F49BC=88=:;F8=77/@EH=EHF9)=A>C>7?;(6@???C?>@####
@HS2000-1015_160:6:1206:11472:64908/2
AGTTTGTTGGACATTTGAGACCCCAGGAAATCCCCTTTCTCGTAACGTTCTCCGCTTGGATCTGATCTCAACAGGGTGTCGTAGTCATTCTTCAGCACAA
+
B@BDFFFFHHHHHIJGIIJIJJIJJJJGEGHHIJJJJJJIJIFFHIIHCHHIJJJGIIJH:CHHFFFFFFFEEEDD=@BDDDAB@DCDDDDDDD>CCB<?
@HS2000-1015_160:7:1114:4995:49287/2
CCTCCGCTCAGCACTGGCATTGGCATCGGTTTCTATGGCAACAGTGAGACCAGTGATGGGGTGTCCCAGCTCAGCTCTGCGCTGCTGCACGCCAACCACA
+
BCCFDFFFHHHHHJJJJJGHEIIJHIGIIFGHGIIIGHEHIIJJDHIJJJJJJEGIGGIDE:?BCEEAE@CCDCDDCDDDDDDDBCCDDD85?9BB@BDD
@HS2000-1015_160:7:1206:16723:26612/2
TTAGATATGCTGTATGTGAAGAAGAGGAGGTTAAAGAACACTGTTTTATGTAAATGTCTCATTCCTTATCCTACAGAAATTGCATTTTTAATTAAATCTT
+
BC@FFFFFHHHHHICIGGHEIGJJIJIEGHGHIJJGGIIIIJIFGIJJIIJIIIJJIIJJJJJIHHGJJGIIIIGIIIHIIFHGHFADFFFDFDE(;@CE
@HS2000-1015_160:5:2101:1745:52266/2
CCCCAGAATTCTCTTGTTTTTTCCTTGGTGATCCAGGAAAACGAAGCCCCCTCCTGTATTGACAGCTGGGAATTGTGGAGTCCACCGTCCTCCACCTGAG
+
C@CFFFFFHHHHHJIJJIJJJJJIIICHCEGIIIEHGIIHIJIGGGIJCHGIHHHGEFHHHGHEEFFDEDAC?CDDCDCD>95>:,,99@DCC?<AB9AC

Result file names I am getting:

${SM}.5.fastq
${SM}.6.fastq
${SM}.7.fastq

Result file names I want:

sample1.5.fastq 
sample1.6.fastq
sample1.7.fastq
MAPK
  • 5,635
  • 4
  • 37
  • 88
  • 1
    Please check if this link helps you [How to use shell variables in awk](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script) – RavinderSingh13 Nov 07 '20 at 15:40
  • @RavinderSingh13 My code splits the files, but the file names are not as what I wanted. File names should be like my "`Result file names I want:`" above in the question. – MAPK Nov 07 '20 at 15:43
  • how many lines you want to output into output file? Could you please clarify more on that part. – RavinderSingh13 Nov 07 '20 at 15:58
  • @RavinderSingh13 The output should be three `.fastq` files. I am just not able to have `${SM}` appended to my file names. The contents of the files are correctly given by the code above. I just need to have `$SM` appended to my output file names. – MAPK Nov 07 '20 at 16:00

2 Answers2

2

EDIT: As per OP's comment adding solution(improved one) including the output file name changing.

SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"} 
/^@HS/{
  split($1,arr,"_")
  sub(/^@[a-zA-Z]+/,"",arr[1])
  lane=$2
  close(outputFile)  
  outputFile=sm"."arr[1]"."lane".fastq"
}
{
  print >> (outputFile)
}' File.fastq


Fixing OP's attempt: Could you please try following, you could actually use -v awk_var_name="$shell_var" for which I shared link in comments section too, I have also fixed few things too in your code.

SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"} 
{
  close(outputFile)
  lane=$2
  outputFile=sm count "."lane".fastq"
  print > (outputFile)
  for (i = 1; i <= 3; i++){getline ; print  > (outputFile)}
}' File.fastq

Fixes in OP's attempts:

  • Created outputFile variable which has the output file name for clarity purposes.
  • Used close command to close the output file, so that we don't get too many file opened error
  • As per experts getline is not much recommended so changed that approach to checking the line number check by doing FNR%4==0

Ideal way could be:

SM="sample1"
awk -v sm="$SM" '
BEGIN{FS = ":"} 
/^@HS/{
  lane=$2
  close(outputFile)  
  outputFile=sm count "."lane".fastq"
}
{
  print >> (outputFile)
}' File.fastq
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • That worked. Thank you. Is it also possible to append lane=$1 to file name. i.e, `sample1.HS2000-1015_160.5.fastq`, `sample1.HS2000-1015_160.5.fastq`, and `sample1.HS2000-1015_160.5.fastq`? – MAPK Nov 07 '20 at 16:08
  • 1
    @MAPK, ok will keep these 2 solutions as it was the base question, will add edited solution too in few mins or so. – RavinderSingh13 Nov 07 '20 at 16:09
  • @MAPK, but your `$1` is NOT always having `2000-1015` then please explain the logic of adding it to output file name? Is it like a constant needs to be added or by any logic? – RavinderSingh13 Nov 07 '20 at 16:10
  • It is `@HS2000-1015_160`, I just need to add `HS2000-1015_160`. Not a constant though. – MAPK Nov 07 '20 at 16:13
  • @MAPK, could you please check my EDIT solution if that helps you(I haven't tested it though)? – RavinderSingh13 Nov 07 '20 at 16:16
  • @MAPK, could you please check it now and let me know if this helps you? – RavinderSingh13 Nov 07 '20 at 16:31
  • 1
    It works, but it also prints the file names on the terminal. Is there a way to supress that> – MAPK Nov 07 '20 at 16:53
  • @MAPK, Sorry I kept it to checking name of output file :) please check it now we should be all Good cheers :) – RavinderSingh13 Nov 07 '20 at 16:54
1

Your problem is that ${SM} is not expanded as variable inside ' quotes.

This is working is design.

The simple and dirty solution is to replace ${SM} with '${SM}' in all places like this:

SM="sample1"
awk 'BEGIN {FS = ":"} {lane=$2 ; print > "'${SM}'."lane".fastq" ; for (i = 1; i <= 3; i++) {getline ; print > "'${SM}'."lane".fastq"}}' < File.fastq

This way the variable ${SM} is expanded into the one-liner script.

Another option. Write your awk script into file and pass the field separator -F option, and input variable -v option . As below:

script.awk

{
  lane=$2 ; 
  print > SM"."lane".fastq" ; 
  for (i = 1; i <= 3; i++) {
    getline ; 
    print > SM"."lane".fastq";
  }
} 

run script.awk

SM="sample1"
awk -F";" -v SM=${SM} -f script.awk File.fastq

improved script.awk

{
  outFile = SM"."$2".fastq";
  print > outFile ; 
  for (i = 1; i <= 3; i++) {
    getline; 
    print > outFile;
  }
} 
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30