3

I have some txt files that look like this (they contain DNA sequences and sample codes):

>SRR1502445.1
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCAACACAGGGGGATAGGNNNNNNNNNNNNNNNNNNNNNNN
>SRR1502445.2
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.3
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.4
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNNNNNNNNNN

I would like to remove the first 15 characters of every other line in the file. This would remove the string GACTACACGTAGTAT from the second, fourth, sixth, eighth lines (etc).

For instance the cut command can remove the first 15characters of every line:

cut -c 1-15 /path/to/file.txt

I'd like to apply to to only every other line, starting with the second.

colin
  • 2,606
  • 4
  • 27
  • 57
  • 2
    Please post the code for what you have tried, and what happens when you run your code. – Greg Hewgill May 13 '15 at 22:15
  • use Biopython or Bioperl or Biojava or BioETC for parse the falta, it 's two line of code ......... – Jose Ricardo Bustos M. May 13 '15 at 22:19
  • @JoseRicardoBustosM. i'd rather find a solution that doess not involve installing one of these packages as there is probably a solution using the base terminal commands. – colin May 13 '15 at 22:23
  • you need truncate, both fasta and qual files – Jose Ricardo Bustos M. May 13 '15 at 22:23
  • add tag `qiime`, i'm wrong then – Jose Ricardo Bustos M. May 13 '15 at 22:24
  • @JoseRicardoBustosM. ...i am aware. A solution to the problem above can be applied to both fasta and qual files. Also- the qiime command you linked cannot remove the first 15 base pairs, it can only cut it off after a certain number (like only include the first 100). – colin May 13 '15 at 22:24
  • I check this .... wait – Jose Ricardo Bustos M. May 13 '15 at 22:27
  • That keeps the first 15 bp and throws away everything else. From the site you linked- "all data at that base position and to the end of the sequence will be removed in the output filtered files." – colin May 13 '15 at 22:27
  • in the FASTA format , it isn't always one line of header and then one line of sequence , usually there are several lines of sequence, and OP needs only delete first 15 letter in several lines of sequence ..... improve the question, please – Jose Ricardo Bustos M. May 14 '15 at 00:57
  • 1
    @colin, will you please address the comment made by Jose Ricardo Bustos M. re: _"in the FASTA format , it isn't always one line of header and then one line of sequence ..."_ and is this applicable to your situation? Because if it's so than any solution that's coded to explicitly skip every other line can fail! – user3439894 May 14 '15 at 13:58
  • @user3439894 while there may be some situations where this in true, 99% of the time fasta (or .fna, DNA sequence files) have one line indicating the sample and the replicate read within sample, and then a second line immediately following with the corresponding DNA sequence. So, I wouldn't worry about that special case. Jose has been quick to propose several solutions that do not work like he thinks they do, and assert things that are not true in this thread. – colin May 14 '15 at 14:10
  • A link to a description of the FASTA format (as that appears to be the file format being discussed), would be appropriate, either added to the question, or to an answer. – ChuckCottrill May 15 '15 at 01:12

6 Answers6

5

If you don't mind using sedand assuming other line starts with > then the following will remove the first 15 contiguous uppercase characters "A-Z" of the other lines:

sed 's/^[A-Z]\{15\}//' file > new_file

Or, in place edit (GNU sed) use -i:

sed -i 's/^[A-Z]\{15\}//' file

Or, in place edit (BSD sed) use -i '':

sed -i '' 's/^[A-Z]\{15\}//' file

Or, back it up:

sed -i.bak 's/^[A-Z]\{15\}//' file

Example:

$ cat file
>SRR1502445.1
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCAACACAGGGGGATAGGNNNNNNNNNNNNNNNNNNNNNNN
>SRR1502445.2
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.3
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.4
GACTACACGTAGTATACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNNNNNNNNNN
$ sed 's/^[A-Z]\{15\}//' file
>SRR1502445.1
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCAACACAGGGGGATAGGNNNNNNNNNNNNNNNNNNNNNNN
>SRR1502445.2
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.3
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.4
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTTGAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTGGAAGCGGGTTCCAAGGCACACAGGGGATAGGNNNNNNNNNNN
$ 
user3439894
  • 7,266
  • 3
  • 17
  • 28
  • is there a way to do this so that I don't have to actually enter the sequence being removed, and instead it just removes the first 15 characters, regardless of what they are? Often what needs to be removed is the first 15 characters, but not necessarily `GACTACACGTAGTAT` – colin May 13 '15 at 23:17
  • @colin Are all the lines starting with `>` always less then 15 characters? If they are you can use the following: `sed 's/^.\{15\}//' file` – user3439894 May 13 '15 at 23:28
  • unfortunately the lines with the > at the beginning can get up at 17+ characters. – colin May 13 '15 at 23:35
  • 1
    @colin Assuming every other line starts with `>` then the following will remove the first 15 uppercase characters "A-Z" of the other lines: `sed 's/^[A-Z]\{15\}//' file` – user3439894 May 13 '15 at 23:47
  • in the FASTA format , it isn't always one line of header and then one line of sequence , usually there are several lines of sequence, and OP needs only delete first 15 letter in several lines of sequence – Jose Ricardo Bustos M. May 14 '15 at 00:56
  • @Jose Ricardo Bustos M., Other then to recognize a DNA sequence I'm not familiar with the FASTA format however If the assumption I've made, the non-DNA sequence lines always start with `>` then it doesn't matter as what I've proposed will only act on the DNA sequence lines as the `sed` command I've posted only removes the first 15 contiguous uppercase alpha characters from lines starting with an uppercase alpha character. So under that assumption and condition what I posted does what the OP asks. – user3439894 May 14 '15 at 01:16
  • Jose- my question is now wrong, and this is the correct answer to it! relax! – colin May 14 '15 at 15:56
  • @user3439894 I've tested this code and it does the job, thankyou! However, if I change 15 to the value 3, I get the following error: `sed: 1: "’s/^[A-Z]{3}//‘": invalid command code ?` – colin May 14 '15 at 16:14
  • 1
    @colin try: `sed 's/^[A-Z]\{3\}//' file` – user3439894 May 14 '15 at 16:18
  • @colin the format fasta show in question is the problem, this could cause problems for future readers ..... It works for you , but can cause problems for others – Jose Ricardo Bustos M. May 14 '15 at 16:59
4

You can try

sed '0~2s/^.\{15\}//g' filename

0~2 takes every 2nd line

^.\{15\}

looks for the first 15 characters

The sed command replaces them with nothing!

Community
  • 1
  • 1
Dipak
  • 59
  • 7
  • This generates the following error: `sed: 1: "0~2s/^.\{15\}//g": invalid command code ~` – colin May 14 '15 at 16:13
  • Which flavour of unix are you using? More over it could be a copy paste eror. Try using the Tilde symbol while writing the code! It worked perfectly fine for me! – Dipak May 14 '15 at 17:35
  • I'm using terminal on mac osx- just entered the code by hand into terminal, i still get the same error: `sed: 1: "0~2/^.\{15\}//g": invalid command code ~`, which is a bummer because your code seems like it would generalize to the most situations! – colin May 14 '15 at 17:44
  • 1
    @colin, OS X uses BSD sed and it doesn't support the `~` in the `0~2s` portion of the `sed` instruction presented by Dipak although the GNU sed does. The `sed` command I provided you does not need to use that paradigm and will not touch the header lines as they have numeric characters in them and the `sed` instruction I've provided can only remove the first 15 contiguous uppercase alpha characters from lines starting with an uppercase alpha character, so there is no need to instruct `sed` to skip lines. – user3439894 May 14 '15 at 19:49
  • Again, consider checking how to format your answer. Use the `{}` button to print code. – fedorqui May 15 '15 at 12:26
0

The following script might help you, it takes two arguments: 1. Original file (from which to make a conversion) 2. File where to save results.

#!/bin/bash
# call this script and pass two arguments:
# ./script FROM_FILE TO_FILE
FROM=$1
TO=$2

i=1;
while IFS=$'\n' read line; do
    ((i++)); 
    # skip 2,4,6, ..., nth lines 
    [ $((i % 2)) -eq 0 ] && (echo -n $line >> $TO; continue);
    echo ${line:15} >> $TO
done < $FROM
haodemon
  • 71
  • 1
  • 6
  • While it does remove the first 15 characters of every other line in the file it also removes the entire every other line starting with the first line too! – user3439894 May 13 '15 at 22:59
  • Now it outputs nothing! Test your code before you post it! – user3439894 May 13 '15 at 23:02
  • You're absolutely right - my fault. Try again, dear. – haodemon May 13 '15 at 23:04
  • 1
    I strongly suggest you reread what colin wants... _"I would like to remove the first 15 characters of every other line in the file. This would remove the string GACTACACGTAGTAT from the second, fourth, sixth, eighth lines (etc)."_ – user3439894 May 13 '15 at 23:05
  • The solution you proposed would only work if colin has file filled with the same strings. He could have very well used a simple text editor to fix that ;) And yet, I find your solution elegant and smart. – haodemon May 13 '15 at 23:18
  • in the FASTA format , it isn't always one line of header and then one line of sequence , usually there are several lines of sequence, and OP needs only delete first 15 letter in several lines of sequence – Jose Ricardo Bustos M. May 14 '15 at 00:56
  • It's nice to know, thanks. Have you tried running the script? ;) It's pretty much does exactly what OP wants. – haodemon May 14 '15 at 08:52
0

you need erase first bases of file fasta and qual for analysis, while i find a solution with QIIME, a solution using python and biopython:

from Bio import SeqIO

file_fasta = open("test.fasta")
file_qual = open("test.qual")

iterator_fasta = SeqIO.parse(file_fasta, "fasta")
iterator_qual = SeqIO.parse(file_qual, "qual")

size_trim = 15

output_fasta = open("trim.fasta","w")
for seq in iterator_fasta:
  if len(seq) <= size_trim:
    raise NameError('len seq less or equal than trim size')
  seq.seq = seq.seq[size_trim:]
  output_fasta.write(seq.format("fasta"))

output_fasta.close()

output_qual = open("trim.qual","w")
for seq_qual in iterator_qual:
  if len(seq_qual.letter_annotations['phred_quality']) <= size_trim:
    raise NameError('len qual less or equal than trim size')
  seq_qual.letter_annotations['phred_quality'] = seq_qual.letter_annotations['phred_quality']
  output_qual.write(seq_qual.format("qual"))

output_qual.close()

you get in trim.fasta

>SRR1502445.1
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTT
GAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTG
GAAGCGGGTTCCAAGGCAACACAGGGGGATAGGNNNNNNNNNNNNNNNNNNNNNNN
>SRR1502445.2
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTT
GAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAGACGGTTGG
AAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.3
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTT
GAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTG
GAAGCGGGTTCCAAGGCACACAGGGGATAGGNNN
>SRR1502445.4
ACGAGTGCGTTCCTGCGCTTATTGATATGCTTAAGTTCAGCGGGTAGTCTCACCCGATTT
GAGGTCAAGGTTTGTGGGTCGAGTCACAACTCGAACATCGTCTTTGTACAAAGACGGTTG
GAAGCGGGTTCCAAGGCACACAGGGGATAGGNNNNNNNNNNN

EDIT:

using qiime, I recommend to use split_libraries, it does the trim and check the quality .... truncate_fasta_qual_files.py only select first B bases, trim last base doing otherwise to expected.

Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
0

Use regular expressions and either perl or awk,

perl (write a script, and expand it to detect other regular expressions,

my $pattern=$ARGV[1]||"GACTACACGTAGT";
#provide any gene sequence prefix, and pattern removes that prefix
while (<>) {
    #explicit check for non-gene/header pattern
    if( $_ =~ /^[\>\;]/ ) {
        print $_;
    }
    #check for the specific header pattern provided, for example
    elsif( $_ =~ /^SRR1502445/ ) {
        print $_;
    }
    #check for the gene pattern given
    elsif( $_ =~ /^$pattern(.*)/ ) {
        print "$1\n";
    }
    else {
        print $_;
    }
}

perl -lane,

perl -lane 'if( $_ =~ /^GACTACACGTAGT(.*)/ ) {print "$1\n";} else {print $_; }'

awk,

/SRR1502445/ { print $0; }
/^GACTACACGTAGTAT/ { print substr($0,16); }

Works on any linux/unix box, and cygwin also.


The file format seems to be FASTA, which is described here FASTA Specification

ChuckCottrill
  • 4,360
  • 2
  • 24
  • 42
  • 1
    You should see the first comment colin made on my answer. The first 15 characters will not always be "GACTACACGTAGTAT" so your answer is in the same boat my first was. – user3439894 May 13 '15 at 23:57
  • in the FASTA format , it isn't always one line of header and then one line of sequence , usually there are several lines of sequence, and OP needs only delete first 15 letter in several lines of sequence – Jose Ricardo Bustos M. May 14 '15 at 00:56
  • The OP Asked for a solution which would remove the first 15 characters from every other line - which from your comments suggests incomplete knowledge of the file format. However, the solution I have offered uses regular expressions, and will solve both the problem as stated, and the more general problem, of how to recognize the gene pattern versus the header pattern (at least for the perl script version). – ChuckCottrill May 15 '15 at 01:02
  • @ChuckCottrill, I think maybe you have misunderstood and or have not followed all the relevant comments. The header row can be different lengths and values and DNS sequences too that will not be the same 15 characters as in the sample. Also the perl script you posted has `else if` in it and isn't that supposed to be `elsif`? I ask because the script as written throws errors there. Additionally in the `perl -lane` line you have `{print "$1\n";}` and shouldn't that just be `{print "$1";}`? Otherwise it's inserting a blank like after every DNS sequence and that's not shown that way in the OP. – user3439894 May 15 '15 at 02:27
0

A one-liner alternative to sed is awk.

Given an alternating-lined-element FASTA file called foo.fa, you can strip the first 15 characters of the sequence strings with substr():

$ awk '/^#/ {next} /^>/ { print $0 } /^[^>]/ { print substr($0, 16, length($0) - 15) }' foo.fa > foo.filtered.fa

As awk uses 1-based indexing, the start position argument in substr() is 16.

In addition to offering code to process alternating lines separately, another advantage of awk is that it can sometimes run faster than sed. Yet another advantage is portability, given differences in sed between common bioinformatics platforms.

So if you plan on doing this a lot or on "whole genome"-scale files, you might investigate this approach, as well.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • Alex, could you please explain a bit, what your one-liner is doing? – user3439894 May 14 '15 at 21:46
  • The `/^#/ {next}` directive applies two different code blocks on the specified regular expression patterns `^>` and `^[^>]`, which represent header and sequence lines in an alternating-line FASTA file, respectively. The `^>` block simply prints the header line (`$0`), while the `^[^>]` block prints the substring of the sequence line (again, `$0`) with a start parameter of `15` and a length parameter of the line length, minus 14. This effectively strips the first 15 characters, whatever they are. – Alex Reynolds May 14 '15 at 22:03
  • Sorry, I made an off-by-one error. The correct start index is 16, not 15. – Alex Reynolds May 14 '15 at 22:10