Extracting certain locus from multiple samples from text file

Question

After profiling STR locus in a population, the output gave me 122 files each of which contains about unique 800,000 locus. There are 2 examples of my files:

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02035 chr2 1489653 (aatg)8 (aatg)11 4
HG02035 chr2 68011947 (tcta)11 (tcta)11 4
HG02035 chr2 218014855 (ggaa)16 (ggaa)16 4
HG02035 chr3 45540739 (tcta)15 (tcta)16 43

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02040 chr2 1489653 (aatg)8 (aatg)8 4
HG02040 chr2 68011947 (tcta)10 (tcta)10 4
HG02040 chr2 218014855 (ggaa)21 (ggaa)21 4
HG02040 chr3 45540739 (tcta)17 (tcta)17 4

I've been trying to extract variants for each of 800,000 STR locus. I expect the output should be like this for chromosome 1 at position of 230769616:

HG02035 chr1 230769616 (tcta)14 (tcta)16 4
HG02040 chr1 230769616 (tcta)15 (tcta)15 4
HG02072 chr1 230769616 (tcta)10 (tcta)15 4
HG02121 chr1 230769616 (tcta)2 (tcta)2 4
HG02131 chr1 230769616 (tcta)16 (tcta)16 4
HG02513 chr1 230769616 (tcta)14 (tcta)14 4

I tried this command: awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf

It worked but it take lots of time to create large number of files for each locus.

I am struggling to find an optimal solution to solve this.

Ed Morton · Accepted Answer · 2023-01-15T15:27:23.250

4

You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).

Assuming you want to get a separate output file for every $2+$3 pair, you should be using the following with any awk:

tail -n +2 -q *.vcf | sort -k2,3 |
awk '
    { cur = $2 "_" $3 ".locus.tsv" }
    cur != out { close(out); out=cur }
    { print > out }
'

If you want to have the header line present in every output file then tweak that to:

{ head -n 1 file1.vcf; tail -n +2 -q *.vcf | sort -k2,3; } |
awk '
    NR==1 { hdr=$0; next }
    { cur = $2 "_" $3 ".locus.tsv" }
    cur != out { close(out); out=cur; print hdr > out }
    { print > out }
'

edited Jan 15 '23 at 15:27

answered Jan 14 '23 at 16:44

Ed Morton

188,023
17
78
185

It should be `print >> out` instead of `print > out`. The ouput file woulnt be overwrited. – Bách Nguyễn Feb 02 '23 at 03:12
@BáchNguyễn no, it shouldn't and it wouldn't. This is awk, not shell - different tools with different languages and different semantics for constructs like `>` and `>>`. – Ed Morton Feb 02 '23 at 13:41
2

unless you mean you want file contents from past runs of the tool preserved - do you? – Ed Morton Feb 02 '23 at 13:49
yes, I just want to collect all the lines at that locus. I used `>` but I received only one line as results. Then I change it to `>>` and it gave me the expected results. Thanks a million btw!! – Bách Nguyễn Feb 05 '23 at 15:49
1

@BáchNguyễn then you're doing something wrong as that's just not the way awk works. Using `>` in awk is like writing `while read line; echo "$line"; done > output` or `> output; while read line; echo "$line" >> output; done` in shell, not `while read line; echo "$line" > output; done`, so it does NOT overwrite the output with each output line, it continually appends to the output. I suspect you're calling awk in a loop once per input file instead of just calling it once for all input files together as I show. You're welcome. – Ed Morton Feb 05 '23 at 16:47

score 4 · Answer 2 · answered Jan 14 '23 at 18:10

4

My VCF file look like this:

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4

this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix. http://www.htslib.org/doc/tabix.html

answered Jan 14 '23 at 18:10

Pierre

34,472
31
113
192

1

The advice is good but you should provide the code for sorting, compressing, indexing and retrieving. +1 nevertheless – Fravadona Jan 14 '23 at 20:14
@Fravadona you're right, there is a good example in the section 'example' of http://www.htslib.org/doc/tabix.html : `(grep "^#" in.gff; grep -v "^#" in.gff | sort -t"`printf '\t'`" -k1,1 -k4,4n) | bgzip > sorted.gff.gz; tabix -p gff sorted.gff.gz; tabix sorted.gff.gz chr1:10,000,000-20,000,000; ` – Pierre Jan 15 '23 at 14:40

score 2 · Answer 3 · answered Jan 14 '23 at 17:19

You can try processing everything in memory before printing them.

FNR > 1 {
    i = $2 "_" $3
    b[i, ++a[i]] = $0
}
END {
    for (i in a) {
        n = i ".locus.tsv"
        for (j = 1; j <= a[i]; ++j)
            print b[i, j] > n
        close(n)
    }
}

This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.

Extracting certain locus from multiple samples from text file

3 Answers3