I have a table of 9M rows, snp_collection.tab
with empty values in several columns:
snpID freq_bin ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding
10:10001753 7 ENSG00000224788 CELF2 FALSE FALSE
10:10001794 41 ENSG00000224788 CELF2 FALSE FALSE
10:100023489 10 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:100025128 45 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:10002975 1 ENSG00000224788 CELF2 FALSE FALSE
Which for simplcity, looks like this:
This is messing up things when I later parse it using awk
to produce a more useful table for my purposes:
Later step which gets messed up from dirty input:
tail -n+2 broadSNPs.tab | gawk -F'[:\t]' -vOFS="\t" '{ print "chr"$1, ($2 - 1), $2, $3, $18, $19, $20, $21}' > broadSNPs.bed
Here, the empty spaces aren't respected by awk
, and rows with empty spaces are offset in the output.
So, I'm trying to fill in those empty spaces before awk
.
Using the advice given here, Filling space/tab separated, empty columns with 0, I've done:
sed 's/\t\t/\tNA\t/g' snp_collection.tab > broadSNPs.tab
I have Also tried replacing the \t
's with ctrl+v+tab
, but alas the output remains the same as i the original snp_collection.tab
, with the empty spaces still just empty.
What am I missing? Thanks!
Desired output:
snpID freq_bin ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding
10:10001753 7 ENSG00000224788 NA NA CELF2 FALSE FALSE
10:10001794 41 ENSG00000224788 NA NA CELF2 FALSE FALSE
10:100023489 10 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:100025128 45 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 TRUE TRUE
10:10002975 1 ENSG00000224788 NA NA CELF2 FALSE FALSE
With this fixed input, my awk
command shouldn't produce a nonsensical output.