0

I have one tab delimited file with column 1 being an ID and column 2 being information. I've a second file with a list of IDs that need to be removed from the first file. When I use grep, I either get a copy of the first file with no changes or I get a blank file using -v with -F -f "file2.txt" flags/arguments. My question is: How do I use file2.txt to compare the IDs from it with file1 to eliminate those rows from file1 to output into file3.

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' $1 > fasta.tab 
grep -F -f $2 fasta.tab -v >rmOutput.tab
tr '\t' \n' <rmOutput.tab >rmOutput.fas
echo Runtime Complete

Line 1: Create tab-delim file from input 1 Line 2: Check input 2 for matches and remove those from tab-delim file Line 3: recreate format of input 1 (For clarity)

EDIT: Sample I/O

Input 1 (tab-delim--after line 1):

ID1    Info1
ID2    Info2
ID3    Info3
ID4    Info4
ID5    Info5

Input 2 (IDs to be deleted):

ID2
ID4
ID5

Desired Output (from line 2)

ID1    Info1
ID3    Info3
  • What is the point of the awk+grep+tr+echo shell script at the top of your question? Also you mention `after line 2` and similar but it's not at all clear how that relates to your sample input/output - clarify that. – Ed Morton Aug 01 '16 at 19:11
  • The input file is a sequence file. That is it's of format >SeqID Sequence etc the point is to turn the file into one large tab-delim file. It was a script given to me, so I'm not sure if it's the most efficient/practical. – Michael Bale Aug 01 '16 at 19:14
  • Are you saying the sample input you posted isn't actually in the input format you have to handle? – Ed Morton Aug 01 '16 at 19:15
  • The first line of the code outputs the sample output listed. – Michael Bale Aug 01 '16 at 19:16
  • But it does nothing even vaguely related to your question or the sample input files you posted. Just post the real input files and the actual output file you want given those input files. That initial shell script is adding no value. – Ed Morton Aug 01 '16 at 19:18

2 Answers2

0
awk 'NR==FNR{a[$0];next} !($1 in a)' input2 input1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

If there are not too many different IDs to delete, run it in a simple loop, removing lines inline with sed:

# bash
cp file1.txt out_file.txt
for rem in `cat file2.txt`
do
  echo $rem
  sed -i "/$rem/d" out_file.txt
done

#fish
cp file1.txt out_file.txt
for rem in (cat file2.txt)
  echo $rem
  sed -i "/$rem/d" out_file.txt
end

PS

anticipating some flame from people with cryptic bash process substitutions and awkward awk scripts, let me say: indeed you should not use this very simple and pleasant to read algorithm, if you have many different IDs to remove, however, according to The Holy Unix Philosophy Principles:

  1. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. (c) Rob Pike

And a more important one:

Rule of Clarity: Clarity is better than cleverness.

Because maintenance is so important and so expensive, write programs as if the most important communication they do is not to the computer that executes them but to the human beings who will read and maintain the source code in the future (including yourself).

And also I'll add a snippet with fish code.

Community
  • 1
  • 1
xealits
  • 4,224
  • 4
  • 27
  • 36
  • Read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) to understand some, but not all, of the reasons why you should never do that. – Ed Morton Aug 01 '16 at 19:16
  • Input file 1 can be anywhere from 20 to 1200 lines with input 2 being anything from 1 to whatever input 1 is -1. – Michael Bale Aug 01 '16 at 19:17
  • @MichaelBale well, then it is too much for a loop. – xealits Aug 01 '16 at 19:19