2

I have at hand a text file containing 690 entries similar to what is shown in the P.S. (shown in P.S. is an example, from here http://www.ncbi.nlm.nih.gov/nuccore/AB753792.1). In my text file entries are separated by "//".

In my case after the "ACCESSION " (String and 3 spaces) there is no upper-case alphanumeric string (such as "AB753792" in P.S.). I am running MacOSX Yosemite with the default Bash and would like to fill the 690 empty spaces with unique upper-case alphanumeric strings such as generated by:

openssl rand -hex 4 | tr '[:lower:]' '[:upper:]'    

(5.1.15: I have changed the above command, it was different in the first version of this post)

I can see how sed / awk could be a solution to this problem, but I can't figure out how sed would be able to insert a unique 8 digit upper case alphanumeric string after each "ACCESSION ".

I would be happy to receive help.

Kind regards,

Paul

P.S.

LOCUS       AB753792                 712 bp    DNA     linear   INV 26-JUN-2013
DEFINITION  Acutuncus antarcticus mitochondrial gene for cytochrome c oxidase
            subunit 1, partial cds.
ACCESSION   AB753792
VERSION     AB753792.1  GI:478246768
KEYWORDS    .
SOURCE      mitochondrion Acutuncus antarcticus
ORGANISM  Acutuncus antarcticus
        Eukaryota; Metazoa; Ecdysozoa; Tardigrada; Eutardigrada; Parachela;
        Hypsibiidae; Acutuncus.
REFERENCE   1
AUTHORS   Kagoshima,H., Imura,S. and Suzuki,A.C.
TITLE     Molecular and morphological analysis of an Antarctic tardigrade,
          Acutuncus antarcticus
JOURNAL   J. Limnol. 72 (s1), 15-23 (2013)
REFERENCE  2  (bases 1 to 712)
AUTHORS   Kagoshima,H. and Suzuki,A.C.
TITLE     Direct Submission
JOURNAL   Submitted (07-OCT-2012) Contact:Hiroshi Kagoshima Transdisciplinary
        Research Integration Center/Nationlal Institute of Genetics; 1111
        Yata, Mishima, Shizuoka 411-8540, Japan
FEATURES             Location/Qualifiers
     source          1..712
                     /organism="Acutuncus antarcticus"
                     /organelle="mitochondrion"
                 /mol_type="genomic DNA"
                 /isolation_source="moss sample (Bryum pseudotriquetrum,
                 Bryum argenteum, and Ceratodon purpureus)"
                 /db_xref="taxon:467037"
                 /country="Antarctica: East antarctica, soya coast,
                 Skarvsnes and Langhovde"
 CDS             <1..712
                 /codon_start=2
                 /transl_table=5
                 /product="cytochrome c oxidase subunit 1"
                 /protein_id="BAN14781.1"
                 /db_xref="GI:478246769"
                 /translation="GQQNHKDIGTLYFIFGVWAATVGTSLSMIIRSELSQPGSLFSDE
                 QLYNVTVTSHAFVMIFFFVMPILIGGFGNWLVPLMISAPDMAFPRMNNLSFWLLPPSF
                 MLITMSSMAEQGAGTGWTVYPPLAHYFAHSGPAVDLTIFSLHVAGASSILGAVNFIST
                 IMNMRAPSISLEQMPLFVWSVLLTAILLLLALPVLAGAITMLLLDRNFNTSFFDPAGG
                 GDPILYQHLFWFFGHPEV"
 ORIGIN      
         1 tggtcaacaa aatcataaag atattggtac actttatttt atttttggag tatgagctgc
       61 tacagtagga acatctctta gtatgattat ccggtcagaa cttagacaac caggatcact
       121 cttctcagat gaacaacttt acaacgttac agtaacaaga catgcatttg tcataatttt
       181 cttttttgta atacccatcc ttattggagg atttggaaat tgactagtac ctttaatgat
       241 ttcagcacca gatatagctt tcccccgaat aaataacctg agattctgac tactaccccc
       301 atcttttata ttaattacta taagaagtat agcagaacaa ggagccggga cagggtgaac
       361 agtttacccc cctttagctc actattttgc acactcagga ccagctgtcg atttaactat
       421 tttttctctg catgtagcag gagcatcgtc gattttagga gccgtaaact tcatttctac
       481 aattatgaat atgcgagctc catcaattag tttagaacaa atgccactat ttgtatgatc
       541 agtactactt acagccattt tacttctact agctctgcca gtattagcag gagccatcac
       601 aatgctttta ttagaccgaa attttaacac atcgtttttt gatcctgctg gtgggggaga
       661 tccaattctc tatcaacatt tattttgatt ttttggtcac cctgaagttt aa
 //    
Paul
  • 45
  • 5
  • openssl rand -base64 32 | | tr '[a-z]' '[A-Z]' gives syntax error...should it be "openssl rand -base64 32||tr '[a-z]' '[A-Z]'"?.... if this is so, this is not generates an 8 digit alphanumeric group.please advise – repzero Jan 04 '15 at 12:27
  • Hi all, thank you for your help so far. As mentioned I did not test the commands for string generation in my original post, as I wrote the post where no Bash was available. The original command was openssl rand -hex 4 | tr '[:lower:]' '[:upper:]' – Paul Jan 04 '15 at 22:58
  • my answer edited to illustrate solution with revised command "openssl rand -hex 4 | tr '[:lower:]' '[:upper:]'" – repzero Jan 04 '15 at 23:51

3 Answers3

2

You can use gawk for that:

gawk '/ACCESSION[ \t]*$/{l=$0;cmd="openssl rand -base64 32 | tr '[a-z]' '[A-Z]'";cmd |& getline a;close(cmd);print l,a;next}{print}' /path/to/input > /path/to/output

It is better readable as multiline script:

#!/usr/bin/gawk -f

# If a line with an empty ACCESSION field appears
# The following block gets executed
/ACCESSION[ \t]*$/ {
    # Backup current line
    line=$0
    # Prepare the openssl command
    cmd="openssl rand -base64 32 | tr '[a-z]' '[A-Z]'"
    # Execute the openssl command and store results into random
    cmd |& getline random;
    close(cmd);
    # Print the line
    printf "%s   %s\n", line, random;
    # Step forward to next line of input. (Don't execute
    # the following block)
    next
}

# Print all other lines - unmodified
{print}

Note that you'll need GNU awk (gawk) for that, since the script utilizes co-processes which are available only with GNU's version of awk.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • you could use sed -i.bak.. so it would create files.bak before running sed inplace. – SMA Jan 04 '15 at 11:23
  • He has one text file with 600 entries, not 600 files. – clt60 Jan 04 '15 at 11:26
  • This makes things easier. Added the `g` option – hek2mgl Jan 04 '15 at 11:27
  • @hek2mgl Isn't there one pipe to many in your first command? Also, I don't know what you mean by *the default MAC version of sed doesn't understand the option -i*. BSD `sed` has the `-i` flag, but its argument is mandatory. – jub0bs Jan 04 '15 at 11:41
  • 1
    Oh, I just copied the openssh command from the question. Thx for the hint. About the `-i` option of `sed` on MAC, seems that I hadn't that perfectly in mind (never used a MAC)... Will update my answer. – hek2mgl Jan 04 '15 at 12:35
  • @hek2mgl Cool. Another thing is that `'s/ACCESSION[ \t]*/ACCESSION '"$NUMBER"'/g'` would probably be better. The OP mentions 3 spaces after "ACCESSION", not a tab. – jub0bs Jan 04 '15 at 13:00
  • 1
    Yep, that seems more stable, added that. Thanks for your help! :) – hek2mgl Jan 04 '15 at 13:10
  • @hek2mgl Thank you for your code solution. I have tried this and it works very well, but unfortunately the string is generated before sed puts it in place so that all ACCESSION values have the same value. However I would like to have them unique. Would you have any ideas on how to advance this? I am running `# 2) this solution produces the same value after each "ACCESSION", not unique values NUMBER=$(openssl rand -hex 4 | tr '[:lower:]' '[:upper:]') sed 's/ACCESSION[ \t]*$/ACCESSION '"$NUMBER"'/g' /Users/paul/Documents/140911_c3_analysis/ref_db_COI/150103_AVC_nem_rot_tar_WITH_TAXONOMY.gb` – Paul Jan 05 '15 at 00:45
1

you can try it as follows followed by your file

#!/bin/bash
for i in {1..7}; do 
    var=$(openssl rand -hex 4 | tr '[:lower:]' '[:upper:]');
    sed  -i.bak '/^ACCESSION   $/{s#ACCESSION   #&'"${var}"'#g;:tag;n;b tag}' "$1"
done

Note I use {1..7} to loop seven times if i have a file with 7 lines of ACCESSION followed by exactly three spaces and end of line

for example

ACCESSION   
VERSION
ACCESSION   
VERSION
ACCESSION   
VERSION    
ACCESSION   
VERSION    
ACCESSION   
VERSION    
ACCESSION   
VERSION    
ACCESSION   

output

ACCESSION   E4197EB1
VERSION
ACCESSION   EFA0CEFF
VERSION
ACCESSION   9499CA54
VERSION    
ACCESSION   2AD2690D
VERSION    
ACCESSION   3598659F
VERSION    
ACCESSION   25608153
VERSION    
ACCESSION   1B43896B

EDIT Since you are using mac OS X you can try alternative

#!/bin/bash
for i in {1..7}; do 
    var=$(openssl rand -hex 4 | tr '[:lower:]' '[:upper:]');
    sed  -i.bak '
    /^ACCESSION   $/{
    s#ACCESSION   #&'"${var}"'#g
    :tag
    n
    b tag
    }' "$1"
done
repzero
  • 8,254
  • 2
  • 18
  • 40
  • Thank you for your help. You code looks very promising but gives me an error message. I have checked again: The format is "ACCESSION" followed by three spaces and a newline. Any ideas why this isn't working for me? I ran: `for i in {1..621}; do var=$(openssl rand -hex 4 | tr '[:lower:]' '[:upper:]'); sed '/^ACCESSION $/{s#ACCESSION #&'"${var}"'#g;:tag;n;b tag}' /Users/paul/Documents/140911_c3_analysis/ref_db_COI/150103_AVC_nem_rot_tar_WITH_TAXONOMY.gb done ` The Error message is `sed: 1: "/^ACCESSION $/{s#ACCE ...": unexpected EOF (pending }'s)` – Paul Jan 05 '15 at 00:37
  • try putting the following codes in a file (script)...then make it executable "chmod a+x . open a terminal and run drag script into terminal followed by a space followed by your file..the exact codes – repzero Jan 05 '15 at 01:33
  • Thanks again for your help. I did what you suggested but the error message remains sed: 1: "/^ACCESSION $/{s#ACCE ...": unexpected EOF (pending }'s) Potentially there is something wrong with the recognition of the input file? Cheers and thanks again. – Paul Jan 05 '15 at 06:43
  • I see you are using MAC OSX...I edited my answer for running this on OSX..see edit below old answer....also reason why are are getting error could be due to the issue here http://stackoverflow.com/questions/15467616/sed-gives-me-unexpected-eof-pending-s-error-and-i-have-no-idea-why – repzero Jan 05 '15 at 11:24
0

thank you very much for your help I used @hek2mgl solution as I could not get the sed commands going.

Thanks for providing comments in the example code. I modified as followed:

#!/usr/local/bin/gawk -f
# If a line with an empty ACCESSION field appears
# The following block gets executed
/ACCESSION/ {
# Backup current line
line=$0
# Prepare the openssl command
cmd="openssl rand -hex 4 | tr '[:lower:]' '[:upper:]'"
# Execute the openssl command and store results into random
cmd |& getline random;
close(cmd);
# Print the line
printf "ACCESSION   %s\n",random;
# Step forward to next line of input. (Don't execute
# the following block)
next
}

# Print all other lines - unmodified
{print}
Paul
  • 45
  • 5