1

I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.

The file I do this in looks like this (genotype.dat):

M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537

and to mask it, I simply change M to S2.

Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).

I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.

Anyway, any suggestions of how to tackle this will be deeply appreciated.

Cyrus
  • 84,225
  • 14
  • 89
  • 153

2 Answers2

1
awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat

How it works

In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.

In detail:

  • NR==FNR{a[$1]=1;next;}

    In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.

  • a[FNR]{$1="S2"}

    If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.

  • 1

    This is awk's cryptic shorthand to print the current line.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • that worked perfectly; many thanks. Apparently I need to generate a new file by adding > maskedgenotype.dat and then mv genotype.dat but that is fine. Also thanks for a very useful explanation. – Isidor Lipsch Mar 09 '15 at 22:01
  • @user3100623 Glad it worked. And, yes, create `maskedgenotype.dat` and then `mv`. If you have a very recent version of GNU awk (4.1.0 or later), then there is a built-in shortcut for this: see http://stackoverflow.com/a/16529730/3030305 – John1024 Mar 09 '15 at 22:09
1

Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):

sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
    genotype.dat > genotype.masked

A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.

shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.

rici
  • 234,347
  • 28
  • 237
  • 341
  • I see, so this would save me the step of generating random numbers externally... can I ask what does "%ds" stand for? s is substitute I'm guessing, but the d? – Isidor Lipsch Mar 09 '15 at 22:06
  • @IsidorLipsch It's a printf format. %d is an integer, and s is a literal s (as in the sed `s` command). If you prefer an external list, you can use that instead of `shuf`, but `shuf` is really handy for sampling. – rici Mar 09 '15 at 22:17