I have been modifying this script and it runs, allowing me to produce the desired output. However, I also get unwanted output as well. For example, the script seems to take the ends of files in the current directory (.txt, .bcf, .bam) and combines them into files ending .txt.bam or .bam.bcf. So the problem is that it creates a large number of excess files which are not needed.
#!/bin/bash
set - eu
genome=/storage1/ref_genome/GRCh38.dxv.fa
export genome
function parallel_call {
bcftools mpileup \
--fasta-ref ${genome} \
--regions $2 \
--output-type u \
$1 | \
bcftools call --multiallelic-caller \
--variants-only \
--output-type u - > ${1/.bam/}.$2.bcf
}
export -f parallel_call
function bam_chromosomes {
samtools idxstats $1 | cut -f 1 | grep -w -f search_file.txt
}
export -f bam_chromosomes
chrom_set=`bam_chromosomes test_file.bam`
parallel --verbose -j 90% parallel_call test_file.bam ::: ${chrom_set}
The ideal output would be:
test_file.chr1.bcf
test_file.chr2.bcf
test_file.chr3.bcf
...
test_file.chr22.bcf
I have tried a few things such as specifying the exact patterns I want grep to pick up. But I have no idea why this is happening - unless my use of parallel is incorrect?
I would upload the files however they are very large and my internet is really slow currently. Any ideas would be great.
CRCh38.dxv.fa = human genome reference
search_file.txt = file containing patterns for grep to search. In this case it contains (chr1, chr2, chr3 ... chr22)
test_file.bam = a bam file used for testing