Removing Characters and Delete Empty Lines with SED/Perl

Question

I want to remove all 'N's from the data that looks like this:

>Seq1
NNNNNNNNA
NNNNNNNNN
ATCGGGGGG
NNNNNNNNN
GTCGGGGGG
>Seq2
GATAAAAAA
NNNNNNNNN

So that it returns:

>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA

But why this doesn't do it:

sed -e 's/N//g'

What's the correct way to approach this?

You want to remove the newline also or only the N's ? – Raghuram Jan 19 '12 at 07:24 — Raghuram, Jan 19 '12 at 07:24

score 2 · Accepted Answer · answered Jan 19 '12 at 07:46

2

Here's my Perl solution:

perl -pe 'if (!/^>/) { tr/N\n//d } elsif ($. > 1) { $_ = "\n$_" }' input-file

answered Jan 19 '12 at 07:46

Sean

29,130
4
80
105

You are missing a newline on the last line. – potong Jan 19 '12 at 20:04

score 1 · Answer 2 · edited May 23 '17 at 12:29

1

Use:

sed ':a;N;$!ba;s/[N\n]//g'

[N\n] matches on either the Ns or the new lines. The rest is taken from this question on StackOverflow.

edited May 23 '17 at 12:29

Community

1
1

answered Jan 19 '12 at 07:30

Ilion

6,772
3
24
47

ah I'm more familiar with perl and didn't know, sed needs special handling to join the lines +1 – Hachi Jan 19 '12 at 07:33
@llion: thanks. But not quite what I want. It puts everything into single line. While what I want is to maintain '>Seq' as its header. See example. – neversaint Jan 19 '12 at 07:36
@neversaint: do you want the first line untouched or do you want special handling for any pattern in the header? – Hachi Jan 19 '12 at 07:41

score 1 · Answer 3 · answered Jan 19 '12 at 09:40

1

This might work for you:

sed '/>Seq/{:a;x;s/N//g;s/\n//2gp;g;x;d};H;$ba;d' file
>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA

or this:

sed ':a;$!{N;ba};s/[N\n]//g;s/>Seq[0-9]*/\n&\n/g;s/.//' file
>Seq1
AATCGGGGGGGTCGGGGGG
>Seq2
GATAAAAAA

answered Jan 19 '12 at 09:40

potong

55,640
6
51
83

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

1

Simple awk should do the trick -

awk '!/^N+/' filename

Test:

[jaypal:~/Temp] cat temp
>Seq1
NNNNNNNNA
NNNNNNNNN
ATCGGGGGG
NNNNNNNNN
GTCGGGGGG
>Seq2
GATAAAAAA
NNNNNNNNN

[jaypal:~/Temp] awk '!/^N+/' temp
>Seq1
ATCGGGGGG
GTCGGGGGG
>Seq2
GATAAAAAA

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 19 '12 at 15:13

jaypal singh

74,723
23
102
147

score 0 · Answer 5 · answered Jan 19 '12 at 07:28

0

you need '\n' to match the newline characters:

sed -e 's/[N\n]//g'

if this doesn't do what you want, please show us, what it does and explain whats different to what you want

answered Jan 19 '12 at 07:28

Hachi

3,237
1
21
29

Removing Characters and Delete Empty Lines with SED/Perl

5 Answers5

Test: