2

I would like to loop over each column in a file and check if all values match. If they do, move on to the next column. Once a mismatch is detected, the loop would stop and only print up to the previous column. I assume Ill need to use arrays in AWK, but wasnt sure how to get started. Here is an example of the dataset Im working with:

superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;annularis
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Anostraca family:Thamnocephalidae genus:Branchinella species;pinnata
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Culex species;hayashii
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;carinata

Looping over the columns (sep by " "), the first two columns match across all rows, but then the 3rd column (class) does not, so the loop would stop there and only print the first two fields , e.g.

superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda

Basically, Id like to keep/print columns that have identical values, and not keep/print columns that have multiple values.

The script would start in column/field 1 and test if all values are the same (comparing strings): if yes (as is the case in example data), then move on to column 2. Test if all values are the same in column 2 (they are), so move on to column 3. Test if all values are the same in column 3 (they are not). So, stop loop/break, and only print previous columns that had identical values.

Not sure what code to start with.

The idea is to loop over the fields in the file and print columns up to where there is a mismatch, determined by testing if the # unique values is greater than 1

for ... do cut -f"$i" -d " " | sort -u>tmpf; if [ $(wc -l < tmpf) = "1" ]; then awk '{printf "%s ;", $0}' tmpf; else break; fi; done

LP

LP_640
  • 579
  • 1
  • 5
  • 17
user95146
  • 128
  • 9
  • add the code you want us to help you with. – Ed Morton Oct 04 '21 at 13:54
  • You can do it in pretty any language you are comfortable with. `awk` is an option, but not a necessity. But I'm not sure if I understand your question correctly: If, say, the first field in the last line of the file is different than the one-but-last, does it mean that you want to print empty lines only? Perhaps you can, without focusing on a certain implementation, sketch the algorithm you have in mind. – user1934428 Oct 04 '21 at 14:08
  • Tried to explain better above, and added some of the potential code to use . e.g. starting from field/column 1, sort column for uniq values, and if more than one unique value (wc -l < $(sort -u) ) then break the loop. – user95146 Oct 04 '21 at 14:56
  • 1
    It's too bad you waited til after the question was closed to add the missing code, now we all just have to wait to see if you get enough votes to reopen it again before anyone can answer it. – Ed Morton Oct 04 '21 at 15:12
  • I could close question and ask again? How many votes will it take to reopen? – LP_640 Oct 04 '21 at 15:32
  • @EdMorton, I have voted to reopen now on this one. – RavinderSingh13 Oct 04 '21 at 19:56
  • 1
    @RavinderSingh13 the OP already opened a new question about it. – Ed Morton Oct 04 '21 at 20:04
  • 1
    @EdMorton, Thanks sir for letting know, since last comments were about reopening votes so I had given vote, if its taken care in a new question then probably this should be deleted by OP but that comes in OP's plate, cheers. – RavinderSingh13 Oct 04 '21 at 20:05
  • `oop over each column` Transpose the file and then loop over lines, then transpose again. – KamilCuk Oct 06 '21 at 01:48

2 Answers2

0

Whenever you want to work with columns, first transpose the file, then work with lines. From An efficient way to transpose a file in Bash :

transpose() {
    awk '
    { 
        for (i=1; i<=NF; i++)  {
            a[NR,i] = $i
        }
    }
    NF>p { p = NF }
    END {    
        for(j=1; j<=p; j++) {
            str=a[1,j]
            for(i=2; i<=NR; i++){
                str=str" "a[i,j];
            }
            print str
        }
    }'
}

then:

transpose < input | awk '
   # Check if all fields are equal
   { for (i=1;i<NF;++i) if ($i != $(i+1)) stop=1; }
   # If not equal print previous lines
   stop{ for (i in lines) print(lines[i]); exit }
   # Remember the line if not stopped.
   { lines[linescnt++] = $0 }
' | transpose
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
0
data=$(< input_file)
for ((i=1; i<=$(awk '{print NF; exit}' <<<$data); i++))
do
   if [ $(cut -d" " -f $i <<<$data |sort -u |wc -l) -ne 1 ]
   then
     awk -v max="$((i-1))" '{for(i=1; i<=max; i++) { printf "%s ", $(i) }; printf "\n" }' <<<$data
     break
   fi
done

superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
ufopilot
  • 3,269
  • 2
  • 10
  • 12