0

I have a file ontology.txt that looks like this

adipose.tissue  subclass_of     connective.tissue       BTO
connective.tissue       part_of whole.body      BTO
adrenal.gland   subclass_of     endocrine.gland BTO
endocrine.gland subclass_of     gland   BTO
gland   part_of whole.body      BTO
adrenal.gland   subclass_of     viscus  BTO
viscus  part_of whole.body      BTO
bone    subclass_of     connective.tissue       BTO
connective.tissue       part_of whole.body      BTO
bone    part_of skeletal.system BTO
skeletal.system part_of whole.body      BTO
bone.marrow     part_of skeletal.system BTO
brain   part_of head    old
head    part_of whole.body      BTO
brain   part_of central.nervous.system  BTO
central.nervous.system  part_of nervous.system  BTO
nervous.system  part_of whole.body      BTO
heart   part_of cardiovascular.system   old
cardiovascular.system   part_of whole.body      BTO
kidney  subclass_of     excretory.gland old
excretory.gland subclass_of     gland   old
kidney  subclass_of     viscus  old
kidney  part_of urinary.tract   old
urinary.tract   subclass_of     viscus  old
kidney  part_of urinary.system  old
urinary.system  part_of urogenital.system       old
urogenital.system       part_of whole.body      old
liver   subclass_of     viscus  old
liver   subclass_of     digestive.gland old
digestive.gland subclass_of     endocrine.gland BTO

I'm trying to write a script that will look at the fourth column, and if it has the word old in it, it won't write that line to the new file. Actually, file-writing isn't what I want to do. I want to do further operations on it. But for now, I'm doing this file-writing thing just to see if the script is fetching the right lines.

The problem is, my script ends up writing all the lines to the new file

#!/bin/bash

while IFS= read -r line
do
        t4=`echo "$line" | cut -f4`
        #echo "$t4"
        if [[ "$t4" != "old" ]]
        then
                echo "$line" >> ont22.txt
        else
                echo "NO"
                #t1=`echo "$line" | cut -f1`
                #t2=`echo "$line" | cut -f3`
        fi
done < ontology.txt

I have literally no clue what could possibly be going wrong with this simple chunk of code. So I'd really appreciate it if someone could point it out.

Note: If you copy over my ontology.txt block above. You should put it into another file (e.g. ont1.txt), and run this line to replace the consecutive spaces with a tab

cat ont1.txt | sed 's/ \+/\t/g' > ontology.txt

Edit: As requrested by @Cyrus, here's the output to head -n 1 ontology.txt | hexdump -C

00000000  61 64 69 70 6f 73 65 2e  74 69 73 73 75 65 09 73  |adipose.tissue.s|
00000010  75 62 63 6c 61 73 73 5f  6f 66 09 63 6f 6e 6e 65  |ubclass_of.conne|
00000020  63 74 69 76 65 2e 74 69  73 73 75 65 09 42 54 4f  |ctive.tissue.BTO|
00000030  0d 0a                                             |..|
00000032
Cyrus
  • 84,225
  • 14
  • 89
  • 153
The_Questioner
  • 240
  • 2
  • 7
  • 17
  • I can't reproduce this. I copied your file, made sure it's tab-delimited, ran your code, and got the expected output. – Benjamin W. May 01 '21 at 20:18
  • 2
    Do you have carriage return characters in your input file? Try `dos2unix ontology.txt`. – Benjamin W. May 01 '21 at 20:19
  • Please add output of `head -n 1 ontology.txt | hexdump -C` to your question. – Cyrus May 01 '21 at 20:20
  • @BenjaminW. I did the `dos2unix` command, and reran the script. It's still not working. – The_Questioner May 01 '21 at 20:24
  • 1
    Is there a reason you're not using `awk -F '\t' '$4!="old"' ontology.txt`? – tripleee May 01 '21 at 20:24
  • 1
    @The_Questioner: your file contains carriage return characters (hex: 0d). – Cyrus May 01 '21 at 20:26
  • @tripleee I'm not good at reading awk, but I'm assuming that'll remove all lines with `old` in the 4th column. I don't actually want to remove the lines. I have further operations I want to do on them. I'm just doing this line-removing thing to see if the code fetches the right lines. – The_Questioner May 01 '21 at 20:26
  • Btw.: Take a look at `while read -r -a array; do echo "${array[3]}"; done < ontology.txt` when you have removed the carriage return characters – Cyrus May 01 '21 at 20:26
  • @Cyrus Oh. I'm actually not sure what to do with carriage return characters. Why are these characters preventing the script from working? Can you tell me how to fix that? Thanks. BTW, I ran the line above, and it's just giving me the 4th column. Should I have seen something different? – The_Questioner May 01 '21 at 20:29
  • @The_Questioner: See Benjamin W.'s comment. 4th column is okay in this example. It's not necessary to use `cut` to get content of 4th column from your file. – Cyrus May 01 '21 at 20:31
  • A shorter version of your script: `dos2unix ont22.txt` – Cyrus May 01 '21 at 20:35
  • @The_Questioner The root problem seems to be that the file is in DOS/Windows text format, which is just different enough from unix format to cause a lot of confusion when using it with unix tools. You can either convert the file to unix first (e.g. with `dos2unix`), or write the script to tolerate/ignore the extra carriage returns, or both. Lots more into [here](https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings) and [here](https://stackoverflow.com/questions/2613800/how-to-convert-dos-windows-newline-crlf-to-unix-newline-lf-in-a-bash-script). – Gordon Davisson May 01 '21 at 20:43
  • Regardless of your ultimate goal, the code you exhibit looks very much like it wants to be put out of its misery and replaced with an Awk script. The shell really isn't very good at this, starting from the fact that a `while read -r` loop is orders of magnitude slower than the equivalent Awk script. (You could improve your code significantly with `while IFS=$'\t' read -r first second third fourth etc` but really, don't.) – tripleee May 03 '21 at 05:28

0 Answers0