1

I want to replace a specific character "M" in a line of text with either "A" or "T". The choice for whether to replace with "A" or "T" should happen at random for each "M" in the line of text.

I tried to write a script using sed to do this, but the evaluation of the random pick of "A" or "T" happens only once on the whole line, rather than at every replacement. My script looks like this:

#!/bin/bash

ambM[0]=A
ambM[1]=T

file_in=${1?Error: no input file}

cat $file_in | sed "s/M/${ambM[$[$RANDOM % 2]]}/g"

But if I use this with a file that is a single line of "M"s:

MMMM

I'll get either all "A"s

AAAA

Or all "T"s

TTTT

Is there something that can be done to make this work with sed? Or maybe an equivalent way to do this with awk? Thanks for any help!

3 Answers3

0

awk to the rescue!

$ echo MMMMMMMMM | awk 'BEGIN {srand()} 
                              {do x=(rand()<0.5?"A":"T"); 
                               while (sub("M",x))}1' 

TTTAATTTT

more generally, for any number of replacement chars specified in variable r

$ ... | awk -v r='A T C G' 'BEGIN{n=split(r,c); srand()} 
                                 {do x=c[int(rand()*n)+1];
                                  while (sub("M",x))}1' 

note that the randomization is not going to be completely uniform, especially for low counts. If you need equal number of replacements in all chars it should be done non-random.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thanks so much for this! Can this be expanded for three or four categories? For example, can you specify in the `x=(rand()<0.5?` more than two categories, by saying something like `<0.33` or `>=0.33 and <0.66` or `>=0.66`? – Vincenzo Ellis Oct 29 '18 at 15:31
  • that way will be messy, I added a generic version. – karakfa Oct 29 '18 at 15:42
0

This might work for you (GNU sed & shuf):

sed '/M/!b;h;x;s/./A\nT\n/g;s/.*/echo "&"|shuf/e;s/\n//g;x;G;:a;s/M\(.*\n\)\(.\)/\2\1/;ta;P;d' file

If the intended character for substitution is not in the current line skip this line. Otherwise, copy the current line and convert it into a string of random A's and T's. Append this string to the current line and replace each M with the head of the string until all M's are catered for. Then print the current line and remove anything remaining in the pattern space.

potong
  • 55,640
  • 6
  • 51
  • 83
0

As long as it's single characters, you could use tr with a really long randomized target string.

tr M AAATTATAAATTTTATTTAAAT... <inputfile

tr will circle around as many times as necessary; so in this example, the first three M:s will be replaced by A, then the next two by T, and so on, starting over when the destination mapping string is exhausted. Just make it really, really long if you want to avoid any cycles.

tr M $(dd if=/dev/urandom bs=65536 count=1 | tr `\000-\077' A | tr -c A T) <inputfile
tripleee
  • 175,061
  • 34
  • 275
  • 318