0

I have a file organisms.txt with one organism (genus and species) per line.

Escherichia coli
Staphylococcus aureus
Prevotella sp. 855
Saprospirales
Candidatus Accumulibacter phosphatis

I want to use grep to search though another file for each organism and write the matches to an output file with the name of the organism. My file large_file.txt is like this:

Parcubacteria bacterium    0    87    2762014
Saprospirales    837    78    1936988
Escherichia coli    857    95    562
Bacteroides ihuae    12    100    1852362
Candidatus Escherichia coli O12H3    988    95    888
Dialister invisus    30    86    218538
Fake Escherichia bacterium    112    99    110
Escherichia coli 07798    1094    99   1005566
Escherichia coli    14    87    562
Saprospirales bacterium    87    98.6    4587674
Saprospirales sp.    12588    99    1936988

I am using this while loop.

while IFS= read -r line
do
out="${line}_hits.txt"
grep "${line}" large_file.txt
> "$out"
done < "organisms.txt"

I have checked manually for the organisms in my list to verify that they are found in large_file.txt and they are definitely found in large_file.txt . The output files are all created using this loop however they are all empty. I would expect for example, that the output file Escherichia coli_hits.txt, would look like this:

    Escherichia coli    857    95    562
    Candidatus Escherichia coli O12H3    988    95    888
    Escherichia coli 07798    1094    99   1005566
    Escherichia coli    14    87    562 

And I would expect the output file Saprospirales_hits.txt to look like this:

Saprospirales    837    78    1936988
Saprospirales bacterium    87    98.6    4587674
Saprospirales sp.    12588    99    1936988

I would also expect a file named Staphylococus aureus_hits.txt to have been created and to be an empty file as well as similar files for all other lines in organisms.txt that were not found in large_file.txt.

What do I need to change to get my desired results?

aminards
  • 309
  • 2
  • 11
  • Is that your exact code, with a linebreak before `> "$out"`? If so, grep writes to standard output, and the `> "$out"` line truncates the file, as nothing is written to it. – Benjamin W. Feb 15 '22 at 14:22
  • Yes, that is my exact code. I moved > "$out" to the end of line 4 but that didn't fix the issue. I am still getting empty output files. – aminards Feb 15 '22 at 14:33
  • Good catch on the typo. I fixed it for my question. – aminards Feb 15 '22 at 17:16
  • Is the white space in `large_file.txt` all blank chars or are some of them tabs? e.g. in the first line are you showing us `Parcubacteriabacterium0872762014` or `Parcubacteriabacterium0872762014` or something else? – Ed Morton Feb 15 '22 at 22:54

2 Answers2

1

The way you redirect to "$out" truncates the file for every loop iteration:

grep "$line" large_file.txt
> "$out" # This truncates the file

This doesn't fix it:

grep "$line" large_file.txt > "$out"

because now, the file $out contains only the most recent result of grep. You should append instead:

grep "$line" large_file.txt >> "$out"

This still opens and closes a filehandle for each iteration, but because the output filename depends on the line being read, you can't move the redirection to outside the loop.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • I appreciate you writing this out. Making the changes however still results in empty files for me. – aminards Feb 15 '22 at 18:19
  • @aminards They'll be empty for each line where there is no match. Can you show a minimal example input for `large_file.txt` that results in an empty file? – Benjamin W. Feb 15 '22 at 18:26
  • I have added an example for large_file.txt and additional explanation to my question. – aminards Feb 15 '22 at 18:50
  • @aminards I get your expected output with my suggested solution. – Benjamin W. Feb 15 '22 at 19:18
  • Benjamin W. fix was correct. The problem I was having is addressed by Ed Morton, my organisms.txt file had Windows line endings and it needed Unix line endings. – aminards Feb 16 '22 at 14:13
1

Given the symptoms you describe I'd guess your organisms.txt has DOS line endings and so line in your script always ends in \r and so Escherichia coli\r, for example, is never present in large_file.txt. See why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Ed Morton, you are correct. The line endings were what was preventing the changes Benjamin W. suggested from working for me. – aminards Feb 16 '22 at 14:15