0

I have a space-separated file that looks like this:

$ cat in_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004927566.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004919950.1 FAD_binding_3
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3

I am using the following shell script utilizing grep to search for strings:

$ cat search_script.sh
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt

The problem is that I want each grep command to return only the first instance of the string it finds exclusive of the previous identical grep command's output.

I need an output which would look like this:

$ cat out_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3

in which line 1 is exclusively the output of the first grep command and line 2 is exclusively the output of the second grep command. How do I do it?

P.S. I am running this on a big file (>125,000 lines). So, search_script.sh is mostly composed of unique grep commands. It is the identical commands' execution that is messing up my downstream analysis.

BhushanDhamale
  • 1,245
  • 2
  • 10
  • 12
  • @WiktorStribiżew I had already gone through that question, and its answers. My question is different, and the answers to that question do not fit my purpose. – BhushanDhamale May 23 '19 at 10:18
  • 1
    I revoked the close vote. – Wiktor Stribiżew May 23 '19 at 10:19
  • 1
    I think you do not need to repeat the command . Use it once ([it seems to extract what you need](https://ideone.com/ntw0CC)), get all matches, then you may iterate over them. – Wiktor Stribiżew May 23 '19 at 10:20
  • 1
    As @WiktorStribiżew suggests, run your `grep` just once and put the results in an array, then loop through the array using them in turn... see point 2 here and put your `grep` in place of `my_command` https://stackoverflow.com/a/32931403/2836621 – Mark Setchell May 23 '19 at 10:41

2 Answers2

1

I'm assuming you are generating search_script.sh automatically from the contents of in_file. If you can count how many times you'll repeat the same grep command you can just use grep once and use head, for example if you know you'll be using it 2 times:

grep "foo" bar.txt | head -2

Will output the first 2 occurrences of "foo" in bar.txt.

If you have to do the grep commands separately, for example if you have other code in between the grep commands, you can mix head and tail:

grep "foo" bar.txt | head -1 | tail -1

Some other commands...

grep "foo" bar.txt | head -2 | tail -1
  • head -n displays the first n lines of the input
  • tail -n displays the last n lines of the input

If you really MUST always use the same command, but ensure that the outputs always differ, the only way I can think of to achieve this is using temporary files and a complex sequence of commands:

 cat foo.bar.txt.tmp 2>&1 | xargs -I xx echo "| grep -v \\'xx\\' " | tr '\n' ' '  | xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp"

So to explain this command, given foo as a search string and bar.txt as the filename, then foo.bar.txt.tmp is a unique name for a temporary file. The temporary file will hold the strings that have already been output:

  • cat foo.bar.txt.tmp 2>&1 : outputs the contents of the temporary file. If none is present, will output an error message to stdout, (important because if the output was empty the rest of the command wouldn't work.)
  • xargs -I xx echo "| grep -v \\'xx\\' " adds | grep -v to the start of each line in the temporary file, grep -v something excludes lines that include something.
  • tr '\n' ' ' replaces newlines with spaces, to have on a single string a sequence of grep -vs.
  • xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp" runs a new command, grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp, replacing xx with the previous output. xx should be the sequence of grep -vs that exclude previous outputs.
  • head -1 makes sure only one line is output at a time
  • tee -a foo.bar.txt.tmp appends the new output to the temporary file.

Just be sure to clear the temporary files, rm *.tmp, at the end of your script.

3snoW
  • 153
  • 1
  • 8
0

If I am getting question right and you want to remove duplicates based on last field of each line then try following(this should be easy task for awk).

awk '!a[$NF]++'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93