2

I want to remove all 'N's from the data that looks like this:

>Seq1
NNNNNNNNA
NNNNNNNNN
ATCGGGGGG
NNNNNNNNN
GTCGGGGGG
>Seq2
GATAAAAAA
NNNNNNNNN

So that it returns:

>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA

But why this doesn't do it:

sed -e 's/N//g' 

What's the correct way to approach this?

neversaint
  • 60,904
  • 137
  • 310
  • 477

5 Answers5

2

Here's my Perl solution:

perl -pe 'if (!/^>/) { tr/N\n//d } elsif ($. > 1) { $_ = "\n$_" }' input-file
Sean
  • 29,130
  • 4
  • 80
  • 105
1

Use:

sed ':a;N;$!ba;s/[N\n]//g'

[N\n] matches on either the Ns or the new lines. The rest is taken from this question on StackOverflow.

Community
  • 1
  • 1
Ilion
  • 6,772
  • 3
  • 24
  • 47
  • ah I'm more familiar with perl and didn't know, sed needs special handling to join the lines +1 – Hachi Jan 19 '12 at 07:33
  • @llion: thanks. But not quite what I want. It puts everything into single line. While what I want is to maintain '>Seq' as its header. See example. – neversaint Jan 19 '12 at 07:36
  • @neversaint: do you want the first line untouched or do you want special handling for any pattern in the header? – Hachi Jan 19 '12 at 07:41
1

This might work for you:

sed '/>Seq/{:a;x;s/N//g;s/\n//2gp;g;x;d};H;$ba;d' file
>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA

or this:

sed ':a;$!{N;ba};s/[N\n]//g;s/>Seq[0-9]*/\n&\n/g;s/.//' file
>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA
potong
  • 55,640
  • 6
  • 51
  • 83
1

Simple awk should do the trick -

awk '!/^N+/' filename

Test:

[jaypal:~/Temp] cat temp
>Seq1
NNNNNNNNA
NNNNNNNNN
ATCGGGGGG
NNNNNNNNN
GTCGGGGGG
>Seq2
GATAAAAAA
NNNNNNNNN

[jaypal:~/Temp] awk '!/^N+/' temp
>Seq1
ATCGGGGGG
GTCGGGGGG
>Seq2
GATAAAAAA
Community
  • 1
  • 1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
0

you need '\n' to match the newline characters:

sed -e 's/[N\n]//g'

if this doesn't do what you want, please show us, what it does and explain whats different to what you want

Hachi
  • 3,237
  • 1
  • 21
  • 29