2

I want to remove the last character of every line that begins with @ from my over 300 files each about 1gb.

My example file is as follows:

@1_1101_1473_2134_1
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_1
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

And I want to remove the last character 1 from the lines that start with @ so my output should be

@1_1101_1473_2134_
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I first tried python, which worked for these lines, but as a newbie, I couldn't figure out how to retain all the lines in an output.

with open("file.fq") as f:
        for line in f:
                length=(len(line)-2)
                if line.startswith('@'):
                        line=line[:length]+''+line[length+1:]
                        print(line)

Which gives of course only the 'lines' but I wanted to show it works

@1_1101_1473_2134_

@1_1101_1635_2243_

Then I tried awk and sed. I can select the lines that start with @ using awk as follows:

awk '{if (/^@/)}'

And I can remove the last characters of each line with sed as:

sed {'s/.$//'}

So I tried of course combining these two, simply as:

awk '{if (/^@/)}' | sed {'s/.$//'} file.fq

Which does not work.

By the way, if possible, I would prefer deleting these characters directly from my files instead of creating a new file with these characters deleted as I have over 300gb of data, and naturally I would prefer a fast way of doing it.

Any help to upgrade my commands, or any alternative way of doing it in any other way is highly appreciated. Also I will want to run the correct command in a loop for all the files, that's why I first tried to generate a python script, so any help about the loop stage for your solution would also be great.

Many Thanks

FatihSarigol
  • 647
  • 7
  • 14
  • 1
    Your only mistake in Python was to indent the `print()` to be part of the `if` statement. *Unindent* that line to be at the same level as the rest of the code in the `for` loop. – Martijn Pieters Nov 06 '16 at 15:50
  • 1
    You cannot do it without creating a new file unless you use `ed` and even then you'll use a buffer the size of the file so it doesn't make any difference. sed -i, etc. all create tmp files on the fly. – Ed Morton Nov 06 '16 at 15:54

4 Answers4

4
$ sed -i '/^@/ s/.$//' file.fq
$ cat file.fq
@1_1101_1473_2134_
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • /^@/ match lines starting with @
  • s/.$// delete the last character of such lines
  • -i inplace editing, usage of -i option varies for different sed versions, see documentation for syntax


With python

import fileinput

with fileinput.input(inplace=True) as f:
    for line in f:
        line = line.rstrip('\n');

        if line.startswith('@'):
            line = line[:-1]

        print(line)
  • This will accept files as command line argument, so you can do something like ./del_last.py *.fq
  • See also Python's slice notation
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • 1
    Amazing one liner! Thank you so much! I was so close! :D I did try to make sed find the lines or awk remove the characters but couldn't (((So if anyone comes to this post and wants to do it as a loop for many files, you can do it on the command line for lets say all the files in the directory using a wildcard at the end instead of the file name, as (( sed -i '/^@/ s/.$//' * )) Cheers – FatihSarigol Nov 06 '16 at 16:01
0

For your Python script you just need to get the print statement out of the conditional suite:

with open("file.fq") as f:
    for line in f:
        if line.startswith('@'):
            line = line[:-2] + '\n'
        print(line, end = '')

If you have enough memory to hold a complete file and a copy you could use a regular expression and make the change to the whole file at once.

import re
pattern = '^(@.*?)\S\r?\n'
rex = re.compile(pattern, flags = re.MULTILINE)
with open("file.fq") as f:
    data = f.read()
new = rex.sub(r'\1\n', data)
wwii
  • 23,232
  • 7
  • 37
  • 77
  • wouldn't you have to also strip the newline first? – Sundeep Nov 06 '16 at 16:11
  • @Sundeep - if OP is only concerned with printing to achieve the desired output format then, yes, the print statement newline needs to be accounted for - see edit. If you strip the whitespace then write back to a file, you will lose some structure. – wwii Nov 06 '16 at 16:22
  • as far as I know and tested it out, `line[:-1]` would remove the newline character if the newline character of `line` is not removed first :) – Sundeep Nov 06 '16 at 16:26
  • @Sundeep - you are right, my bad. I'll remove it unless you want to downvote. – wwii Nov 06 '16 at 16:31
  • I would suggest to just edit with relevant details :) – Sundeep Nov 06 '16 at 16:33
0

This should work :

sed 's/\(^@.*\)./\1/' <file>
Serjik
  • 10,543
  • 8
  • 61
  • 70
rvxtrm
  • 1
  • 2
-1

from the first @ to next @. Are their line numbers are equal? is that 4 for whole files?

@1_1101_1473_2134_1
CATGCGGGAGGAGGAGGACGAGGACCTGCTGCAGTTTGCCATCCAGCAGAGTCTCCTGGAGGTGGGGGCCGAGTACGACCAGGTAACACCCC
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBBFFFFF<FFFFFF/BFBF7FFBFFFFFFFFFFBFFFFFF
@1_1101_1635_2243_1
CATGCACACCTCCCGGTCTCCGTTGTGGAGGATCAGGTCCACGATCTCCTGGGTCCACGTGGTGCCTACACACACACACACACACACACACA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

if it is like that. It may be helpful for you. Find first @ then remove last character of 4th lines. then from this line (4) find next @ then +4 lines and remove last character....

Mamed
  • 95
  • 4
  • 16
  • Thanks for your comment. (I didnt vote it down btw) I also thought about your idea, and thats true for my file, they are on every 4th line. I can add characters for example to the end of each second line using (( awk '{if (NR%2==0) {$0=$0 "newcharacter"}; print}' file )) but I couldnt formulate it to remove the last line from each 4th line. Thank you! – FatihSarigol Nov 06 '16 at 16:08