I need to edit a bash script that sorts .vcf files. vcf files are roughly structured as shown below:
## header line
## header line
…
Data line
Data line
…
The script is called vcfsort and is part of a library for manipulating vcf files. It looks like this:
head -1000 $1 | grep "^#"; cat $@ | grep -v "^#" | sort -k1,1d -k2,2n
And it is run by writing vcfsort input.vcf > output.vcf
.
I understand roughly what it does: since sorting should only be done on the data lines, it gets the header lines:
head -1000 $1 | grep "^#";
And combines it with sorted data lines:
cat $@ | grep -v "^#" | sort -k1,1d -k2,2n
I need the head command to read more lines. Instead of calling vcfsort like above, I thought I could just edit the script myself and write it out directly as a command like this:
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
This does not work as expected. My attempt above writes the correct output to stdout, if I leave out > output.vcf
. However, if I include it, only the data lines are written to file and the header lines are written to stdout. So, I have a couple of questions:
- In this stack overflow answer, it is said that to combine semicolon-separated commands, they should be enclosed in parentheses. Why is that not the case in the vcfsort script?
- Why is
$@
used in the cat command instead of$1
?$@
should refer to all of a shell scripts arguments, but since only one is given (the input file), why not just use $1? If there is a reason for this, how can I transfer that to my command line expression? - Why do I only get part of the stdout when I send it to a file?
- Could you show me the edits I need to make to get my command to work as intended?