print line with match and next line, but only first match, from a file of strings

Question

I have two files, one with a newline separated list of number IDs

>cat list.txt
3342
232
...

and one with those IDs and some sequence data in the line after

>cat Seqeunce.txt

>600
ATCGCGG
>3342
ACTCGGTC
>232
TGTGCT
>3342
ACGCGGTC

I would like to print all lines with the ID match and the next line, but only the first time a match is found. So, the out put would be:

> ...some code... list.txt Sequence.txt
>3342
ACTCGGTC
>232
TGTGCT

Note that only the line with the first occurrence of ID 3342, and the next line, is printed

I tried using grep,

grep -f list.txt -A 1 -m 1 Sequence.txt

But it wasnt working. Just running grep -A 1 and -m 1 with the actual ID produced what I want, but I have thousands of IDs and cant run each by hand.

This requires rescanning the sequence.txt for each id in list.txt. If you have thousands and thousands, you will trash your hard drive waiting hours to finish. You could do it in 1 pass if you know the id's (list.txt) ahead of time. You'd need a script however to do it. You can create a regex trie using [this](http://www.regexformat.com/Dnl/_Samples/_Ternary_Tool%20(Dictionary)/___txt/Q-words.txt) tool, then match the data file with it. Result is instantaneous. — , Aug 27 '15 at 15:16
An excellent point. If you truly have thousands of IDs to lookup, you should use a tool suitable for lookups... you could do a very simple program where `Sequence.txt` was read into a map/hash/associative array (whatever the language calls them), and then you could perform lookups quickly and easily. — dcsohl, Aug 27 '15 at 18:35

score 2 · Answer 1 · answered Aug 27 '15 at 15:08

2

awk 'NR==FNR{tgts[">"$0]; next} $0 in tgts{c=2; delete tgts[$0]} c&&c--' list.txt sequence.txt
>3342
ACTCGGTC
>232
TGTGCT

answered Aug 27 '15 at 15:08

Ed Morton

It reads: if c is non-zero then decrement c and if the result of that is that c is still non-zero then invoke the default action of printing the current record. You might think you could do something like `c-->0` instead but I'm not convinced that on a huge file the `c--` won't exceed the size of a variable and wrap around to become positive again (like -MAXINT - 1 = MAXINT). You can see more uses of it at http://stackoverflow.com/a/18409469/1745001 – Ed Morton Aug 27 '15 at 15:16

score 1 · Answer 2 · answered Aug 27 '15 at 15:08

1

You can use this awk command:

awk -F'>' 'NR==FNR{a[$1];next} $2 in a{p=1; print; delete a[$2]; next}; 
      p; {p=0}' list.txt Sequence.txt
>3342
ACTCGGTC
>232
TGTGCT

answered Aug 27 '15 at 15:08

anubhava

dcsohl · Answer 3 · 2015-08-27T17:20:20.240

0

You are so close. Give this a try:

for id in `cat list.txt`; do grep -A 1 -m 1 -x ">$id" Sequence.txt; done

edited Aug 27 '15 at 17:20

answered Aug 27 '15 at 15:03

dcsohl

1

@EdMorton - D'oh. You are correct. I have remedied this. – dcsohl Aug 27 '15 at 17:20

3 Answers3