Using bash and awk to delete lines that don't include one of a list of strings

Question

I have a very large text file, myReads.sam, that looks like this:

J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT   
J00118:315:HMJWTBBXX:4:2211:19532:14449 4   *   0   0   *   *   0   0   CR:Z:TATGTCATCTTTCCTC

I have another 500 line text file, myIDs.txt, that looks like this:

CR:Z:TTTGTCATCTGTTTGT
CB:Z:CTACCCAGTCGACTGC
QT:Z:AAFFFJJJ

I want to create a third text document, myFilteredReads.sam, that excludes any line that does not contain one of the character strings in myIDs.txt . So, for example, if I applied this filter using the snippet of myReads.sam and myIDs.txt above, the new file would look like:

J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT

I know if I was only filtering on a single string (e.g. 'CR:Z:TTTGTCATCTGTTTGT'), I could use awk like this:

cat myReads.sam | awk '!/CR:Z:TTTGTCATCTGTTTGT/' > myPartiallyFilteredReads.sam

I'm not sure how to command awk to replace the part in quotes with each line of file, though. I thought I might try looping through the files:

cat myIDs.txt | awk 'BEGIN {i = 1; do { !/i/; ++i } while (i < 500) }' myReads.sam > myFilteredReads.sam

...but that hasn't worked for me.

Any suggestions? Thanks in advance.

Deleted my answer now, moving to a comment - `awk 'FNR==NR{idString[$NF]; next}$NF in idString' myIDs.txt myReads.sam `. It is one of many duplicates floating around in the `awk` tag — Inian, Jun 06 '18 at 20:02
The double negative in your question is confusing people. "exclude lines that don't include" == "keep lines that include". And maybe if you'd thought of it that way you would have realized how easy it is. — Barmar, Jun 06 '18 at 20:10
@Inian suggest duplicate posts please? Looks like duplicate to me, too. Maybe [this](https://stackoverflow.com/questions/22837707/filtering-file-entries-based-on-another-file-as-match-condition) or [this](https://stackoverflow.com/questions/18044007/filter-a-file-with-other-file-in-bash) — zx8754, Jun 06 '18 at 20:48
@zx8754 : The problem with those ones are, they represent a different column number in which the duplicate entries are present. In this case on the last column. It wouldn’t be the case in others. People would easily shoo them off as non duplicates — Inian, Jun 06 '18 at 20:55
Oops, yes, the double negative threw me off track. The `-v` should be removed from my suggestions: `grep -Ff myIDs.txt myReads.sam` and `grep -wFf myIDs.txt myReads.sam`. — Benjamin W., Jun 06 '18 at 21:11

David C. Rankin · Accepted Answer · 2018-06-06T20:10:31.760

2

You have a very simple way to accomplish what you are attempting. grep allows reading patterns from a file, and the -v option reverses the match. So you can simply find all lines in your myFilteredReads.sam that do not contain patterns in myIDs.txt with

grep -v -f myIDs.txt myFilteredReads.sam

Example Use/Output

Using your data in data.txt and your IDs in filter.txt, you get your desired results, e.g.

$ grep -v -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:2211:19532:14449 4   *   0   0   *   *   0   0   CR:Z:TATGTCATCTTTCCTC

Edit -- If you Want Only Lines that ARE in myIDs.txt

Then remove the -v, e.g.

$ grep -f filter.txt data.txt
J00118:315:HMJWTBBXX:4:1118:21684:2246  4   *   0   0   *   *   0   0   CR:Z:TTTGTCATCTGTTTGT

Sorry I misunderstood what you intended to include/exclude.

edited Jun 06 '18 at 20:10

answered Jun 06 '18 at 19:40

David C. Rankin

81,885
6
58
85

I was just about to suggest that. Note that adding -F makes the search faster and should work here! – ewramner Jun 06 '18 at 19:42
It was just the first thing that came to mind. Looping line-by-line works as well, but letting `grep` handle the read instead of the shell prevents calling `grep` multiple times within the body of the loop. If the data file is small -- it makes little difference -- if the data file is huge -- it makes a huge difference `:)` Good call on the use of `-F` for fixed-string treatment. – David C. Rankin Jun 06 '18 at 19:44
1

I think you got confused by the double negatives in the question. Your result is the opposite of his desired result. Get rid of `-v`. – Barmar Jun 06 '18 at 20:07
In that case -- yes I did - thanks. – David C. Rankin Jun 06 '18 at 20:08
Ah, this worked like a charm! Thank you! – K M Jun 06 '18 at 21:34

score 0 · Answer 2 · answered Jun 06 '18 at 19:40

0

main is the file with the content

str is the file with the 'interesting strings'

out is the output file

#!/bin/bash

while read line; do
  grep ${line} main >> out
done < str

answered Jun 06 '18 at 19:40

chenchuk

5,324
4
34
41

This will be slow since you're processing the main file *N* times – glenn jackman Jun 06 '18 at 19:47
If a line in `main` matches multiple lines in `str`, it will be duplicated in `out`. – Barmar Jun 06 '18 at 20:09

Using bash and awk to delete lines that don't include one of a list of strings

2 Answers2