0

I need to edit a bash script that sorts .vcf files. vcf files are roughly structured as shown below:

## header line
## header line
…
Data line
Data line
…

The script is called vcfsort and is part of a library for manipulating vcf files. It looks like this:

head -1000 $1 | grep "^#"; cat $@ | grep -v "^#" | sort -k1,1d -k2,2n

And it is run by writing vcfsort input.vcf > output.vcf. I understand roughly what it does: since sorting should only be done on the data lines, it gets the header lines:

head -1000 $1 | grep "^#";

And combines it with sorted data lines:

cat $@ | grep -v "^#" | sort -k1,1d -k2,2n

I need the head command to read more lines. Instead of calling vcfsort like above, I thought I could just edit the script myself and write it out directly as a command like this:

head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf

This does not work as expected. My attempt above writes the correct output to stdout, if I leave out > output.vcf. However, if I include it, only the data lines are written to file and the header lines are written to stdout. So, I have a couple of questions:

  1. In this stack overflow answer, it is said that to combine semicolon-separated commands, they should be enclosed in parentheses. Why is that not the case in the vcfsort script?
  2. Why is $@ used in the cat command instead of $1? $@ should refer to all of a shell scripts arguments, but since only one is given (the input file), why not just use $1? If there is a reason for this, how can I transfer that to my command line expression?
  3. Why do I only get part of the stdout when I send it to a file?
  4. Could you show me the edits I need to make to get my command to work as intended?
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Stephen Miller
  • 503
  • 3
  • 11

2 Answers2

0

So the script gets first 1000 lines of first file! Separates header, and basically just copy all comments in those first 1000 lines to output.

Next, it filters all comments lines (leaving only data lines) for all files, and does sorting.

so if you use

vcfsort file1 file2 file3

$1 = "file1" and header from file1 only will be presented in output.

while $@ referring to all files: "file1 file2 file3"

if you need to get headers from all files and merge it - I would recommend to use loop.

for file in $@; do
    head -1000 $file | grep "^#"; 
done
cat $@ | grep -v "^#" | sort -k1,1d -k2,2n

Why do I only get part of the stdout when I send it to a file?

head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf

Each command executing separatelly (divided by semicolon ";"). So in example above you just redirecting data lines output after sorting. It doesn't redirect to file header part. I would recommend to delete redirecting to file and just use:

vcfsort input.vcf > output.vcf

This does not work as expected

May I know what was expected?

Alex Kapustin
  • 1,869
  • 12
  • 15
  • I think you misunderstand how the script is meant to be used. You only ever give it one input file. It is not meant for combining files. I just realized, though, that I actually almost provide the answer myself. I feel like an idiot: I need to enclose the head and cat command in parentheses. I just tried it and it works. Maybe the $@ is just sloppy coding then. I believe it would allow you to combine the data lines of several files with the headers lines of the first file, but that is not something you would want to do. – Stephen Miller Jan 28 '22 at 00:21
  • I'm unsure about sense of merging vcf file. I just read bash script and see that it was ment to use multiple files :) Otherwise, command for sorting would be different. something like that: grep -v "^#" $1 | sort -k1,1d -k2,2n – Alex Kapustin Jan 28 '22 at 00:31
  • Exactly my point. I believe it should have been $1 and not $@. I think it is sloppy coding, then. Thank you for discussing it, though. It helped me realize the error:-) – Stephen Miller Jan 28 '22 at 00:34
0

There are two command lists, separated by a ;, inside vcfsort:

  • head -1000 $1 | grep "^#"
  • cat $@ | grep -v "^#" | sort -k1,1d -k2,2n

Each list is a single pipeline. The final two commands in each pipeline inherit their standard output from vcfsort, so that when you run

vcfsort input.vcf > output.vcf

both grep and sort write to output.vcf.

The equivalent using braces would be (replacing ; with a newline for readability)

# Quoting the parameter expansions is important, to protect
# against word-splitting and pathname expansion of the original arguments.
{ head -1000 "$1" | grep "^#"
  cat "$@" | grep -v "^#" | sort -k1,1d -k2,2n
} > output.vcf

Output redirections apply only to a single command, not a command list. Here, a command group serves as that single command: the standard output of the command group is output.vcf, and the two lists in the group inherit that just as before.


Your attempt

head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf

only opened output.vcf to use as the standard output for sort; the standard output of grep remains whatever standard output it inherits from its parent, namely your terminal.

chepner
  • 497,756
  • 71
  • 530
  • 681