replace specific columns on lines not starting with specific character in a text file

Question

I have a text file that looks like this:

>long_name
AAC-TGA
>long_name2
CCTGGAA

And a list of column numbers: 2, 4, 7. Of course I can have these as a variable like:

cols="2 4 7"

I need to replace every column of the rows that don't start with > with a single character, e.g an N, to result in:

>long_name
ANCNTGN
>long_name2
CNTNGAN

Additional details - the file has ~200K lines. All lines that don't start with > are the same length. Line indices will never exceed the length of the non > lines.

It seems to me that some combination of sed and awk must be able to do this quickly, but I cannot for the life of me figure out how to link it all together.

E.g. I can use sed to work on all lines that don't start with a > like this (in this case replacing all spaces with N's):

sed -i.bak '/^[^>]/s/ /N/g' input.txt

And I can use AWK to replace specific columns of lines as I want to like this (I think...):

awk '$2=N'

But I am struggling to stitch this together

You never need sed when you're using awk. – Ed Morton Jun 12 '20 at 15:21 — Ed Morton, Jun 12 '20 at 15:21

oguz ismail · Accepted Answer · 2020-06-12T06:53:18.647

2

With GNU awk, set i/o field separators to empty string so that each character becomes a field, and you can easily update them.

awk -v cols='2 4 7' '
BEGIN {
  split(cols,f)
  FS=OFS=""
}
!/^>/ {
  for (i in f)
    $(f[i])="N"
}
1' file

Also see Save modifications in place with awk.

edited Jun 12 '20 at 06:53

answered Jun 12 '20 at 06:35

oguz ismail

1
16
47
69

score 1 · Answer 2 · answered Jun 12 '20 at 06:57

You can generate a list of replacement commands first and then pass them to sed

$ printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g'
/^>/! s/./N/2
 /^>/! s/./N/4
 /^>/! s/./N/7
$ printf '2, 4, 7' | sed -E 's|[^0-9]*([0-9]+)[^0-9]*|/^>/! s/./N/\1\n|g'
/^>/! s/./N/2
/^>/! s/./N/4
/^>/! s/./N/7

$ sed -f <(printf '2 4 7' | sed -E 's|[0-9]+|/^>/! s/./N/&\n|g') ip.txt
>long_name
ANCNTGN
>long_name2
CNTNGAN

Can also use {} grouping

$ printf '2 4 7' | sed -E 's|^|/^>/!{|; s|[0-9]+|s/./N/&; |g; s|$|}|'
/^>/!{s/./N/2;  s/./N/4;  s/./N/7; }

You can mix seds `-e` and `-f` one or more times and `-f` accepts the `-` as stdin, so `<<<'2 4 7' sed -E 's#\S+#s/./N/&\n#g' | sed -e '/^>/b' -f - file` will achieve the same results. — potong, Jun 12 '20 at 11:46

score 0 · Answer 3 · answered Jun 12 '20 at 15:17

0

Using any awk in any shell on every UNIX box:

$ awk -v cols='2 4 7' '
    BEGIN { split(cols,c) }
    !/^>/ { for (i in c) $0=substr($0,1,c[i]-1) "N" substr($0,c[i]+1) }
1' file
>long_name
ANCNTGN
>long_name2
CNTNGAN

answered Jun 12 '20 at 15:17

Ed Morton

188,023
17
78
185

replace specific columns on lines not starting with specific character in a text file

3 Answers3