2

I'm relatively new to scripting and apologize in advance for this painfully simple problem. I believe I've searched pretty thoroughly, but apparently no other answers or cookbooks have been explicit enough for me to understand (like here - still couldn't get it).

I have a file that is made up of strings of letters (DNA, if you care), one string per line. Above each string I've inserted another line to identify the underlying string. For those of you who are bioinformaticians, I'm trying to make up a test data set in fasta format, maybe you have tools? Anyway, I'd put a distinct word, "num", after each ">" with the intention of using a bash incrementer and sed to create a unique number heading each string. For example, in data.txt, I have...

>num, blah, blah, blah

ATCGACTGAATCGA

>num, blah, blah, blah

ATCGATCGATCGATCG

>num, blah, blah, blah

ATCGATCGATCGATCG

I would like it to be...

>0, blah, blah, blah

ATCGACTGAATCGA

>1, blah, blah, blah

ATCGATCGATCGATCG

>2, blah, blah, blah

ATCGATCGATCGATCG

The solution can be in any language as long as it's complete && gets the job done. I have a little experience with sed, awk, bash, and c++ (little == slightly more than no experience). I know, I know, I need to learn perl, but I've only just started. The question is this: How to replace "num" with a number that increments on each replacement? It doesn't matter if the underlying string is identical to another somewhere else. Thanks for your help in advance!

Community
  • 1
  • 1
vincent
  • 1,370
  • 2
  • 13
  • 29
  • Totally (pseudo) off-topic, but please checkout [Haskell](http://www.haskell.org/haskellwiki/Applications_and_libraries). – Jared Farrish Jun 11 '11 at 00:06
  • For instance, [Genetic programming](http://www.haskell.org/haskellwiki/Applications_and_libraries/Genetic_programming). – Jared Farrish Jun 11 '11 at 00:07
  • Sed is not the tool to use here. You can't combine sed and bash in the way you want. It would be easier to write an editor macro in Emacs or Vim than to do it in sed+bash. (Awk would work, though.) As I said, even a real editor would work. You need to explore some tools and start learning them. Almost anything will be better than nothing. -- You probably wrote out all the "num" lines by hand, too, right? – yam655 Jun 11 '11 at 00:31
  • @Jared, thanks I'll look into Haskell. @yam655, Good to know that I can't use sed and bash this way. I just started using Vim last week and so far I like it a lot better than pico or nano :). Give me a little credit here, I used awk to insert the 40,000 lines, I guess you can call that "by hand". – vincent Jun 11 '11 at 03:06

2 Answers2

8
perl -ple 's/num/$n++/e' filename

dry run 1st, if it is do that, what you want

clt60
  • 62,119
  • 17
  • 107
  • 194
1

This uses process substitution, which may or may not be available on your system.

jcomeau@intrepid:/tmp$ exec 3< <(cat test.txt)
jcomeau@intrepid:/tmp$ i=0
jcomeau@intrepid:/tmp$ while read -u 3 first_word the_rest; do
 if [ "$first_word" == ">num," ]; then
 echo ">$i," $the_rest; i=$((i + 1)); else
 echo $first_word $the_rest; fi; done
>0, blah, blah, blah

ATCGACTGAATCGA

>1, blah, blah, blah

ATCGATCGATCGATCG

>2, blah, blah, blah

ATCGATCGATCGATCG
jcomeau_ictx
  • 37,688
  • 6
  • 92
  • 107
  • This worked perfectly, too, and thanks for your answer! It seemed like it ran a tiny bit slower than the perl line above, but I find this a very interesting way to get the job done! I didn't know (but guess I should have) that you could do so much from the command line. Does it maintain the i=0 only until you run the following command? I'd vote you up if I could, but I don't have enough reputation yet (this was my first post). Thanks again! – vincent Jun 11 '11 at 03:25
  • i starts off as 0, but it increments with i=$((i + 1)), and whatever it is after the loop, it remains, until set to another value or unset. – jcomeau_ictx Jun 11 '11 at 03:29
  • Cool! This stuff is great. On a side note: your life seems cool. Keep it up and good luck! – vincent Jun 11 '11 at 03:34