-1

I have a table of 9M rows, snp_collection.tab with empty values in several columns:

snpID   freq_bin    ID_nearest_gene ID_nearest_gene_located_within  HGNC_nearest_gene_snpsnap   HGNC_nearest_gene_snpsnap_protein_coding    flag_snp_within_gene    flag_snp_within_gene_protein_coding
10:10001753      7  ENSG00000224788                 CELF2   FALSE   FALSE
10:10001794     41  ENSG00000224788                 CELF2   FALSE   FALSE
10:100023489    10  ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   TRUE    TRUE
10:100025128    45  ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   TRUE    TRUE
10:10002975      1  ENSG00000224788                 CELF2   FALSE   FALSE

Which for simplcity, looks like this:enter image description here

This is messing up things when I later parse it using awk to produce a more useful table for my purposes:

Later step which gets messed up from dirty input:

tail -n+2 broadSNPs.tab | gawk -F'[:\t]' -vOFS="\t" '{ print "chr"$1, ($2 - 1), $2, $3, $18, $19, $20, $21}' > broadSNPs.bed

Here, the empty spaces aren't respected by awk, and rows with empty spaces are offset in the output.

So, I'm trying to fill in those empty spaces before awk.

Using the advice given here, Filling space/tab separated, empty columns with 0, I've done:

sed 's/\t\t/\tNA\t/g' snp_collection.tab > broadSNPs.tab

I have Also tried replacing the \t's with ctrl+v+tab, but alas the output remains the same as i the original snp_collection.tab, with the empty spaces still just empty.

What am I missing? Thanks!

Desired output:

snpID     freq_bin  ID_nearest_gene ID_nearest_gene_located_within  HGNC_nearest_gene_snpsnap   HGNC_nearest_gene_snpsnap_protein_coding    flag_snp_within_gene    flag_snp_within_gene_protein_coding
10:10001753      7  ENSG00000224788             NA     NA   CELF2   FALSE   FALSE
10:10001794     41  ENSG00000224788             NA     NA   CELF2   FALSE   FALSE
10:100023489    10  ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   TRUE    TRUE
10:100025128    45  ENSG00000138131 ENSG00000138131 LOXL4   LOXL4   TRUE    TRUE
10:10002975      1  ENSG00000224788             NA     NA   CELF2   FALSE   FALSE

With this fixed input, my awk command shouldn't produce a nonsensical output.

Carmen Sandoval
  • 2,266
  • 5
  • 30
  • 46
  • Please provide example file as text (not screenshot image), and expected output. Does it have to be awk/sed, maybe use R? – zx8754 May 03 '18 at 20:05
  • Just edited to provide a text example. The actual table is ~9M rows of data, so I'd like to avoid loading it into R. Thanks! – Carmen Sandoval May 03 '18 at 22:59

1 Answers1

2

Simplest solution may be to tell awk that single tab (\t) is field separator:

Default behavior is for awk to treat any consecutive white-space as field separator:

echo -e "A\tB\t\tC\tD" | awk '{ print "c1="$1, "c2="$2, "c3="$3, "c4="$4, "c5="$5 }'
c1=A c2=B c3=C c4=D c5=

Note how awk treats the two consecutive tabs as single separator ie, we want C char to be 4th column, but it shows up as 3rd column.

Specifying single tab as separator results in column 3 being empty:

echo -e "A\tB\t\tC\tD" | awk -F'\t' '{ print "c1="$1, "c2="$2, "c3="$3, "c4="$4, "c5="$5 }'
c1=A c2=B c3= c4=C c5=D

For more information see:

https://stackoverflow.com/a/30406868/1515117

Vince
  • 3,325
  • 2
  • 23
  • 41