I would like to loop over each column in a file and check if all values match. If they do, move on to the next column. Once a mismatch is detected, the loop would stop and only print up to the previous column. I assume Ill need to use arrays in AWK, but wasnt sure how to get started. Here is an example of the dataset Im working with:
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;annularis
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Anopheles species;dirus
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Anostraca family:Thamnocephalidae genus:Branchinella species;pinnata
superkingdom:Eukaryota phylum:Arthropoda class:Insecta order:Diptera family:Culicidae genus:Culex species;hayashii
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;ambigua
superkingdom:Eukaryota phylum:Arthropoda class:Branchiopoda order:Diplostraca family:Daphniidae genus:Daphnia species;carinata
Looping over the columns (sep by " "), the first two columns match across all rows, but then the 3rd column (class) does not, so the loop would stop there and only print the first two fields , e.g.
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
superkingdom:Eukaryota phylum:Arthropoda
Basically, Id like to keep/print columns that have identical values, and not keep/print columns that have multiple values.
The script would start in column/field 1 and test if all values are the same (comparing strings): if yes (as is the case in example data), then move on to column 2. Test if all values are the same in column 2 (they are), so move on to column 3. Test if all values are the same in column 3 (they are not). So, stop loop/break, and only print previous columns that had identical values.
Not sure what code to start with.
The idea is to loop over the fields in the file and print columns up to where there is a mismatch, determined by testing if the # unique values is greater than 1
for ... do cut -f"$i" -d " " | sort -u>tmpf; if [ $(wc -l < tmpf) = "1" ]; then awk '{printf "%s ;", $0}' tmpf; else break; fi; done
LP