1

I have a code like that:

#!/usr/bin/perl
use strict;
use warnings;      
my %proteins = qw/
    UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
    CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
    AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
    GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
    /;
open(INPUT,"<dna.txt");
while (<INPUT>) {    
    tr/[a,c,g,t]/[A,C,G,T]/;
    y/GCTA/CGAU/;    
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
        print $proteins{$protein};
        }
}
}
close(INPUT);

This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl

The output of the program is:

SIMQNISGREAT

How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?

PS 1: dna.txt is like that:

TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.

Community
  • 1
  • 1
kamaci
  • 72,915
  • 69
  • 228
  • 366
  • 10
    Don't. Readability is good. – geekosaur Mar 23 '11 at 08:11
  • 3
    While Perl certainly has many interesting one-liners, they aren't always _better_ than longer, more verbose code. Is there something _specific_ you want to improve? – sarnold Mar 23 '11 at 08:13
  • 1
    Nothing about homework cos the code is good enough for a homework. Just I want to learn that how to improve that code with less line code cos I am interested in one liner coding. – kamaci Mar 23 '11 at 08:40
  • It's also worth noting that using commas in a character class is unnecessary, as if I'm not mistaken you're just making corresponding commas commas. Instead, you would want to use [acgt] and [ACGT], although I believe the arguments to TR act as a sort of character class. – Cooper Mar 23 '11 at 17:17

5 Answers5

3

The only changes I would recommend making are simplifying your while loop:

while (<INPUT>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
            print $proteins{$protein};
        }
    }
}

Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)

You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.

sarnold
  • 102,305
  • 22
  • 181
  • 238
2
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.

I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.

A few other versions I mentioned in the comments:

Reading %p and the DNA from files:

#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g

From shell with perl -e:

perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
jho
  • 2,243
  • 1
  • 15
  • 10
  • How about getting %p variable from a file and how I will define DATA variable? – kamaci Mar 23 '11 at 08:42
  • @kamaci: DATA is a special file handle that points to the beginning of a data section delimited by `__DATA__`. Everything after `__DATA__` is treated as a comment, and you can read from it with ``. But if you wanted to read everything from files, here's another try: http://pastebin.com/RHNdZbU5 – jho Mar 23 '11 at 09:12
  • 1
    your solution is look like perfect. One more to learn for me. I had a question: http://stackoverflow.com/questions/5317461/how-to-determine-number-of-times-a-word-appears-in-text and one of the answer was that: perl -0777ne "print+(@@=/count/g)+0" terrible.pl Can we change that code like that? – kamaci Mar 23 '11 at 09:18
  • @kamaci: I have to admit I haven't done much perl golfing in the command line, and to be fair I think this is a bit too long a script to be ran directly like that (at least until someone chimes in and tells how to efficiently handle multiple files ;) ). Here goes anyway: `perl -e 'open A,";open B,";y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'` – jho Mar 23 '11 at 09:31
  • When I run that code from command line it says something like file not found when I am at the directory of them? – kamaci Mar 23 '11 at 20:34
2

Somebody (@kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:

perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt

(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.

Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.

Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.

LHMathies
  • 2,384
  • 16
  • 21
  • How about puting proteins into proteins.txt Does it make the code more less? – kamaci Mar 24 '11 at 17:43
  • Unbelievable. I think it works for not upper case characters too. – kamaci Mar 24 '11 at 17:50
  • 2
    That's another question -- but yes, you can save a bit by putting the 63-character string into proteins.txt (without the quotes); and then it turns out that it's very easy to turn it into an array instead, saving a lot by replacing the `substr` call. Because I'm using the `-n` flag to read the files, there's a bit of the logic necessary to get the data stashed away, but it's an overall win: `perl -nE'/F/?@@=/./g:say+map+@@[(s/GGG/GGC/i,vec($_,0,32)&101058048)%63],/.../g' proteins.txt dna.txt` – LHMathies Mar 24 '11 at 18:00
  • @LHMathies I am new to Perl and just want to learn that what can be done with Perl and what is the power of it. Also I wonder about the one-liner coding. This was a homework and solved it but I am not looking for an answer for homework just I want to learn how it can be improved and what can I learn more, thanks for your helps cos of your answers I decided to learn Perl and one-liner coding. – kamaci Mar 24 '11 at 18:03
  • 2
    And yes, it does case folding by ignoring the bit (0x20) that differs between upper and lower case. On the other hand, it will probably fail laughably on EBCDIC Perl. – LHMathies Mar 24 '11 at 18:03
  • Wonderful. Both voting up, accepting as answer, voting up comments. Thanks. – kamaci Mar 24 '11 at 18:05
  • Another question just to learn. What does "invalid codons" mechanism does and I couldn't understand how we changed the UUU F UUC F UUA L... array into a FYVDINLHL%VEMKLQL%... and the purpose of % character within it. Sorry for asking them but just I want to learn. – kamaci Mar 24 '11 at 18:08
  • 2
    @kamaci, I don't think there's a great career in one-liner coding, but it does hone your awareness of edge cases in the language that can otherwise bite you because you happen to invoke them. And it's fun. Perl is one the worst/most interesting languages in that regard. (My own production code runs clean under `strict` and `warnings` and checks every system call -- for a start). – LHMathies Mar 24 '11 at 18:08
  • 2
    @kamaci: I changed the `UUU F ...` table into the string by running another Perl program to find out which number between 0 and 62 that each three-letter DNA codon becomes under the `(s/GGG/GGC/i,vec($_,0,32)&101058048)%63` operation, and then putting the right character at each position in the string. The `%` characters are there to take up the positions that don't code a protein, otherwise the string indexing would go wrong. – LHMathies Mar 24 '11 at 18:13
  • @LHMathies Thanks again and again. Your comments improve my view to Perl and programming languages. I will want just one more help. Can you check that topic: http://stackoverflow.com/questions/5478604/implementing-dot-point-algorithm-with-less-line-of-code-at-perl Thanks for your kindness and helps again. – kamaci Mar 29 '11 at 20:56
1

Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.

use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna  = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my @result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    push @result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', @result;
die "@result is not expected" if $got ne $expected;
print "@result - $got\n";

The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:

push @result, map $proteins{$_} // (), m/(...)/g;

Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)

Lumi
  • 14,775
  • 8
  • 59
  • 92
0

If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.

#!/usr/bin/perl
use strict;
use warnings;      

open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);

while (<INPUT>) {
    tr/GCTA/CGAU/;
    while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);

You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi

Dai Nguyen-Van
  • 229
  • 2
  • 4