I have a very large text file, myReads.sam, that looks like this:
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
J00118:315:HMJWTBBXX:4:2211:19532:14449 4 * 0 0 * * 0 0 CR:Z:TATGTCATCTTTCCTC
I have another 500 line text file, myIDs.txt, that looks like this:
CR:Z:TTTGTCATCTGTTTGT
CB:Z:CTACCCAGTCGACTGC
QT:Z:AAFFFJJJ
I want to create a third text document, myFilteredReads.sam, that excludes any line that does not contain one of the character strings in myIDs.txt . So, for example, if I applied this filter using the snippet of myReads.sam and myIDs.txt above, the new file would look like:
J00118:315:HMJWTBBXX:4:1118:21684:2246 4 * 0 0 * * 0 0 CR:Z:TTTGTCATCTGTTTGT
I know if I was only filtering on a single string (e.g. 'CR:Z:TTTGTCATCTGTTTGT'), I could use awk like this:
cat myReads.sam | awk '!/CR:Z:TTTGTCATCTGTTTGT/' > myPartiallyFilteredReads.sam
I'm not sure how to command awk to replace the part in quotes with each line of file, though. I thought I might try looping through the files:
cat myIDs.txt | awk 'BEGIN {i = 1; do { !/i/; ++i } while (i < 500) }' myReads.sam > myFilteredReads.sam
...but that hasn't worked for me.
Any suggestions? Thanks in advance.