Remove Last Character of Each Line that Starts With @

Question

I want to remove the last character of every line that begins with @ from my over 300 files each about 1gb.

My example file is as follows:

@1_1101_1473_2134_1
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_1
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

And I want to remove the last character 1 from the lines that start with @ so my output should be

@1_1101_1473_2134_
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I first tried python, which worked for these lines, but as a newbie, I couldn't figure out how to retain all the lines in an output.

with open("file.fq") as f:
        for line in f:
                length=(len(line)-2)
                if line.startswith('@'):
                        line=line[:length]+''+line[length+1:]
                        print(line)

Which gives of course only the 'lines' but I wanted to show it works

@1_1101_1473_2134_

@1_1101_1635_2243_

Then I tried awk and sed. I can select the lines that start with @ using awk as follows:

awk '{if (/^@/)}'

And I can remove the last characters of each line with sed as:

sed {'s/.$//'}

So I tried of course combining these two, simply as:

awk '{if (/^@/)}' | sed {'s/.$//'} file.fq

Which does not work.

By the way, if possible, I would prefer deleting these characters directly from my files instead of creating a new file with these characters deleted as I have over 300gb of data, and naturally I would prefer a fast way of doing it.

Any help to upgrade my commands, or any alternative way of doing it in any other way is highly appreciated. Also I will want to run the correct command in a loop for all the files, that's why I first tried to generate a python script, so any help about the loop stage for your solution would also be great.

Many Thanks

Your only mistake in Python was to indent the `print()` to be part of the `if` statement. *Unindent* that line to be at the same level as the rest of the code in the `for` loop. — Martijn Pieters, Nov 06 '16 at 15:50
You cannot do it without creating a new file unless you use `ed` and even then you'll use a buffer the size of the file so it doesn't make any difference. sed -i, etc. all create tmp files on the fly. — Ed Morton, Nov 06 '16 at 15:54

Sundeep · Accepted Answer · 2017-09-24T03:39:19.900

$ sed -i '/^@/ s/.$//' file.fq
$ cat file.fq
@1_1101_1473_2134_
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

/^@/ match lines starting with @
s/.$// delete the last character of such lines
-i inplace editing, usage of -i option varies for different sed versions, see documentation for syntax

With python

import fileinput

with fileinput.input(inplace=True) as f:
    for line in f:
        line = line.rstrip('\n');

        if line.startswith('@'):
            line = line[:-1]

        print(line)

This will accept files as command line argument, so you can do something like ./del_last.py *.fq
See also Python's slice notation

Amazing one liner! Thank you so much! I was so close! :D I did try to make sed find the lines or awk remove the characters but couldn't (((So if anyone comes to this post and wants to do it as a loop for many files, you can do it on the command line for lets say all the files in the directory using a wildcard at the end instead of the file name, as (( sed -i '/^@/ s/.$//' * )) Cheers — FatihSarigol, Nov 06 '16 at 16:01

wwii · Answer 2 · 2016-11-07T04:37:48.500

0

For your Python script you just need to get the print statement out of the conditional suite:

with open("file.fq") as f:
    for line in f:
        if line.startswith('@'):
            line = line[:-2] + '\n'
        print(line, end = '')

If you have enough memory to hold a complete file and a copy you could use a regular expression and make the change to the whole file at once.

import re
pattern = '^(@.*?)\S\r?\n'
rex = re.compile(pattern, flags = re.MULTILINE)
with open("file.fq") as f:
    data = f.read()
new = rex.sub(r'\1\n', data)

edited Nov 07 '16 at 04:37

answered Nov 06 '16 at 16:06

wwii

23,232
7
37
77

wouldn't you have to also strip the newline first? – Sundeep Nov 06 '16 at 16:11
@Sundeep - if OP is only concerned with printing to achieve the desired output format then, yes, the print statement newline needs to be accounted for - see edit. If you strip the whitespace then write back to a file, you will lose some structure. – wwii Nov 06 '16 at 16:22
as far as I know and tested it out, `line[:-1]` would remove the newline character if the newline character of `line` is not removed first :) – Sundeep Nov 06 '16 at 16:26
@Sundeep - you are right, my bad. I'll remove it unless you want to downvote. – wwii Nov 06 '16 at 16:31
I would suggest to just edit with relevant details :) – Sundeep Nov 06 '16 at 16:33

score 0 · Answer 3 · edited Nov 07 '16 at 16:15

0

This should work :

sed 's/\(^@.*\)./\1/' <file>

edited Nov 07 '16 at 16:15

Serjik

10,543
8
61
70

answered Nov 06 '16 at 16:11

rvxtrm

1
2

does not your pattern need `$` to indicate end of line. – Jay Rajput Nov 07 '16 at 00:45

score -1 · Answer 4 · answered Nov 06 '16 at 15:56

from the first @ to next @. Are their line numbers are equal? is that 4 for whole files?

@1_1101_1473_2134_1
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_1
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

if it is like that. It may be helpful for you. Find first @ then remove last character of 4th lines. then from this line (4) find next @ then +4 lines and remove last character....

Thanks for your comment. (I didnt vote it down btw) I also thought about your idea, and thats true for my file, they are on every 4th line. I can add characters for example to the end of each second line using (( awk '{if (NR%2==0) {$0=$0 "newcharacter"}; print}' file )) but I couldnt formulate it to remove the last line from each 4th line. Thank you! — FatihSarigol, Nov 06 '16 at 16:08

Remove Last Character of Each Line that Starts With @

4 Answers4