2

I have a text file containing my input of strings to be grepped from other file along with content following the string. I am using

grep -A -f file1.txt file2.txt > output.txt

But it is not giving the result. Where I am doing mistake.

input file1

536911
536912
536920

input file 2

>gi|536911|CP006573.1|:c959-690 Mannheimia haemolytica D171, complete genome
ATGAAATGCGAACGTTTAGAAGAGTTATTAGAGTTACTTGGCGAACATTGGCGTAAAAATCCTGACTTAC
ACCTCATTGATATTTTGCAGCAGCTTTCAGTTGAAGTGGGCGAGCCTGATAATTTCAAAGCGTTAAGCGA
TGAAGTGTTAATCTATCAGCTTAAAATGCGAAATGCAGGCAAATTTGAGCCTATTCCCGGCATAAAAAAA
GATTATGAAGATGATTTTAAAACGGCTTTATTGCGAGCTCGTGGAATTTTAAACGATTAA
>gi|536912|gb|CP006573.1|:c6390-2194 Mannheimia haemolytica D171, complete genome
ATGAAGACCAAAACATTTACTCGTTCTTATCTTGCTTCTTTTGTAACAATCGTATTAAGTTTACCTGCTG
TAGCATCTGTTGTACGTAATGATGTGGACTATCAATACTTCCGCGATTTTGCCGAAAATAAAGGACCATT
TTCAGTTGGTTCAATGAATATTGATATTAAAGACAACAATGGACAACTTGTAGGCACGATGCTTCATAAT
TTACCAATGGTTGATTTTAGTGCTATGGTAAGAGGTGGATATTCTACTTTAATTGCACCACAATATTTAG
TTAGTGTTGCACATAATACTGGATATAAAAATGTTCAATTTGGTGCTGCAGGTTATAACCCTGATTCACA
TCACTATACTTATAAAATTGTTGACCGCAATGATTATGAAAAGGTTCAAGGAGGGTTGCACCCAGACTAT
>gi|536913|gb|CP006573.1|:7500-8540 Mannheimia haemolytica D171, complete genome
ATGTTTTATTCTAACAACCCTCTCATTAAACACAAGACCGGTTTATTAAATTTAGCAGAAGAACTGGGTA
ATATTTCTCAAGCCTGCAAAGTAATGGGAATGAGCCGAGATACATTCTATCGTTATCAACAAGCGGTTGA
GCAAGGTGGTGTTGAAGCATTGCTGAATCAAAATAGACGCGTTCCCAACTTAAAAAATCGTGTTGATGAG

required output

>gi|536911|CP006573.1|:c959-690 Mannheimia haemolytica D171, complete genome
ATGAAATGCGAACGTTTAGAAGAGTTATTAGAGTTACTTGGCGAACATTGGCGTAAAAATCCTGACTTAC
ACCTCATTGATATTTTGCAGCAGCTTTCAGTTGAAGTGGGCGAGCCTGATAATTTCAAAGCGTTAAGCGA
TGAAGTGTTAATCTATCAGCTTAAAATGCGAAATGCAGGCAAATTTGAGCCTATTCCCGGCATAAAAAAA
GATTATGAAGATGATTTTAAAACGGCTTTATTGCGAGCTCGTGGAATTTTAAACGATTAA
>gi|536912|gb|CP006573.1|:c6390-2194 Mannheimia haemolytica D171, complete genome
ATGAAGACCAAAACATTTACTCGTTCTTATCTTGCTTCTTTTGTAACAATCGTATTAAGTTTACCTGCTG
TAGCATCTGTTGTACGTAATGATGTGGACTATCAATACTTCCGCGATTTTGCCGAAAATAAAGGACCATT
TTCAGTTGGTTCAATGAATATTGATATTAAAGACAACAATGGACAACTTGTAGGCACGATGCTTCATAAT
TTACCAATGGTTGATTTTAGTGCTATGGTAAGAGGTGGATATTCTACTTTAATTGCACCACAATATTTAG
TTAGTGTTGCACATAATACTGGATATAAAAATGTTCAATTTGGTGCTGCAGGTTATAACCCTGATTCACA
TCACTATACTTATAAAATTGTTGACCGCAATGATTATGAAAAGGTTCAAGGAGGGTTGCACCCAGACTAT

How to achieve this task? Using grep or Sed

Thanks in advance

nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
  • What result are you getting? – Etan Reisner Nov 23 '14 at 17:45
  • 1
    You are using `-A` without its required argument. See [man grep](http://linux.die.net/man/1/grep). – whoan Nov 23 '14 at 17:50
  • I am not getting anything..Argument failed.. –  Nov 23 '14 at 17:51
  • Understood since I do no know the number of lines print after string, I missed that, any other way to get answer? –  Nov 23 '14 at 17:54
  • Looks like your input is a FASTA file. Have you looked at FASTA-related tools, and/or [tag:fasta] questions on StackOverflow? – tripleee Nov 23 '14 at 18:22
  • Yes my input file is fasta but since my task related to extracting data using commands, I have not check fasta questions, now I'll check –  Nov 23 '14 at 20:20

2 Answers2

1

Since you are unsure of the number of lines following the pattern -A option wont help you.

An awk solution would be like

$ awk -F\| 'NR==FNR{pattern[$0];next} { if ($2 in pattern){flag=1} else if(NF > 1){flag=0}} flag' file1 file2
>gi|536911|CP006573.1|:c959-690 Mannheimia haemolytica D171, complete genome
ATGAAATGCGAACGTTTAGAAGAGTTATTAGAGTTACTTGGCGAACATTGGCGTAAAAATCCTGACTTAC
ACCTCATTGATATTTTGCAGCAGCTTTCAGTTGAAGTGGGCGAGCCTGATAATTTCAAAGCGTTAAGCGA
TGAAGTGTTAATCTATCAGCTTAAAATGCGAAATGCAGGCAAATTTGAGCCTATTCCCGGCATAAAAAAA
GATTATGAAGATGATTTTAAAACGGCTTTATTGCGAGCTCGTGGAATTTTAAACGATTAA
>gi|536912|gb|CP006573.1|:c6390-2194 Mannheimia haemolytica D171, complete genome
ATGAAGACCAAAACATTTACTCGTTCTTATCTTGCTTCTTTTGTAACAATCGTATTAAGTTTACCTGCTG
TAGCATCTGTTGTACGTAATGATGTGGACTATCAATACTTCCGCGATTTTGCCGAAAATAAAGGACCATT
TTCAGTTGGTTCAATGAATATTGATATTAAAGACAACAATGGACAACTTGTAGGCACGATGCTTCATAAT
TTACCAATGGTTGATTTTAGTGCTATGGTAAGAGGTGGATATTCTACTTTAATTGCACCACAATATTTAG
TTAGTGTTGCACATAATACTGGATATAAAAATGTTCAATTTGGTGCTGCAGGTTATAACCCTGATTCACA
TCACTATACTTATAAAATTGTTGACCGCAATGATTATGAAAAGGTTCAAGGAGGGTTGCACCCAGACTAT

What it does?

  • -F\| sets the field seperator as |

  • 'NR==FNR{pattern[$0];next} stores the pattern from first file to an array pattern. Here NR==FNR true for the first file, file1

  • { if ($2 in pattern){flag=1} if the second column, $2 is in array pattern, sets the flag as one

  • else if(NF > 1){flag=0}} sets the flag as 0 only when the pattern is not found in the line and the line contian >gi|xxxxx|

  • flag if the flag is set, performs the default action to print the entire line

nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
0

you could remove the line ends and then make each record into a line:

cat file2.txt| tr -d '\n' | sed -e $'s/>gi/\\\n>gi/g'| grep -f file1.txt

Or paying heed to "useless use of cat" ;-)

tr -d '\n' < file2.txt | sed -e $'s/>gi/\\\n>gi/g' | grep -f file1.txt

à la chomp, split in Perl.

Community
  • 1
  • 1
G. Cito
  • 6,210
  • 3
  • 29
  • 42
  • This script is working well but problem is in output file 'i' replaced by g n >gi replaced by >g. I am not worrying about this since, there is no change in desired sequence format. Thank you –  Nov 24 '14 at 09:05
  • sorry mixed up my cut and paste with my shell history :-( - use `tr` **but only for line ends** then use `sed`. The point was to show that it's easier to do a simple replacement step with `tr` before using `sed` (and that `sed` needs double escaping to work from the shell) – G. Cito Nov 24 '14 at 15:00
  • hope this didn't cause any inconvenience – G. Cito Nov 24 '14 at 15:04
  • Finally I have done with analysis and I got required output..Thank you very much –  Nov 25 '14 at 13:24