2

I have a text file that looks like this:

>long_name
AAC-TGA
>long_name2
CCTGGAA

And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:

cols="2 4 7"

I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:

>long_name
ANCNTGN
>long_name2
CNTNGAN

Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.

It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.

E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):

sed -i.bak '/^[^>]/s/ /N/g' input.txt

And I can use AWK to replace specific columns of lines as I want to like this (I think...):

awk '$2=N'

But I am struggling to stitch this together

oguz ismail
  • 1
  • 16
  • 47
  • 69
roblanf
  • 1,741
  • 3
  • 18
  • 24

3 Answers3

2

With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.

awk -v cols='2 4 7' '
BEGIN {
  split(cols,f)
  FS=OFS=""
}
!/^>/ {
  for (i in f)
    $(f[i])="N"
}
1' file

Also see Save modifications in place with awk.

oguz ismail
  • 1
  • 16
  • 47
  • 69
1

You can generate a list of replacement commands first and then pass them to sed

$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
 /^>/! s/./N/4
 /^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7

$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN


Can also use {} grouping

$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2;  s/./N/4;  s/./N/7; } 
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • 1
    You can mix seds `-e` and `-f` one or more times and `-f` accepts the `-` as stdin, so `<<<'2 4 7' sed -E 's#\S+#s/./N/&\n#g' | sed -e '/^>/b' -f - file` will achieve the same results. – potong Jun 12 '20 at 11:46
0

Using any awk in any shell on every UNIX box:

$ awk -v cols='2 4 7' '
    BEGIN { split(cols,c) }
    !/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN
Ed Morton
  • 188,023
  • 17
  • 78
  • 185