0

To iterate over a list of characters that are known at the time of writing the program (in this example, the characters are "X", "Y", "Z"):

for (i = 1; i <= 3; ++i) {
    c = substr("XYZ", i, 1)
    # do something with the character
}

Question: Is there a more awk-y way of doing this? Note that this is not the same as this question, as the characters I want to iterate over are not a part of the input.

To put it in context, I need to count the occurences of X, Y and Z on a particular position in a line over all lines. The input should consist only of X, Y and Zs on lines of the same length:

$ cat input.txt
XYXXXYZZYXY
XXXYYYZYYZY
YZZZZYZZXZZ
XXZXXYYZXZY

$ foo.awk < input.txt
X 3 2 2 2 2 0 0 0 2 1 0
Y 1 1 0 1 1 4 1 1 2 0 3
Z 0 1 2 1 1 0 3 3 0 3 1

This is foo.awk at the moment:

#!/bin/awk -f
BEGIN {
    FS = ""
}
NR == 1 {
    len = NF
}
{
    for (i = 1; i <= NF; ++i)
        ++profile[$i][i]
}
END {
    for (c = 1; c <= 3; ++c) {
        char = substr("XYZ", c, 1)
        printf "%s", char
        for (i = 1; i <= len; ++i)
            printf " %d", profile[char][i]
        printf "\n"
    }
}

I have not used awk before so probably my whole approach is totally wrong.

Community
  • 1
  • 1

1 Answers1

3

Your script looks good. Here is a version that illustrates some slight variations in style:

#!/usr/bin/awk -f
BEGIN {
    FS = ""
    split("XYZ",chars,"")
}
{
    for (i = 1; i <= NF; ++i)
        ++profile[$i,i]
}
END {
    for (c=1;c in chars;c++) {
        printf "%s", chars[c]
        for (i = 1; i <= NF; ++i)
            printf " %d", profile[chars[c],i]
        printf "\n"
    }
}

The statement split("XYZ",chars,"") creates an array chars that has your letters in it. That way, the characters can be referred to by subscript.

Your script uses multidimensional arrays which is a GNU extension. In the script above, I used the standard awk method for getting the same result. (The setting is FS="" also a GNU extension.)

Lastly, the outer for loop in END was changed to scan over the array indices with for (c=1;c in chars;c++) .... This has the advantage of working even if you change the number of elements in chars. The disadvantage is that, unless we complicate the code, awk does not guarantee that the indices come out in order.

John1024
  • 109,961
  • 14
  • 137
  • 171
  • Thank you! As far as the `for (c in chars)`, this is what I had, before I realized that it doesn't necessarily iterate in the initial order. So I guess it must be `for (i = 1; i <= length; ++i)`. –  Apr 12 '14 at 08:36
  • And not only are multidimensional arrays an extension, they are new enough not to be in the version of gawk installed on a server I have to use.... –  Apr 12 '14 at 09:08
  • 1
    +1 FYI setting `FS = ""` is also a gawk extension, the behavior is unspecified by POSIX. The `len` variable and associated `NR==1` block isn't necessary as `NF` will be set in the `END` section to it's value on the last read line so you can just loop to `NF` instead of to `len`. You also don't need the 3rd arg to `split()` since what you're using is the same as `FS`. – Ed Morton Apr 12 '14 at 12:24
  • 1
    @Boris consider `for (i = 1; i in chars; ++i)` instead of `for (i = 1; i <= length; ++i)`. `length` with no args is a gawk extension to provide the length of the current input record ($0), not the length of the array you're interested in. length(array) returning the number of entries in an array is also a gawk extension. $0 being set in the END section is ALSO a gawk extension. The `in` operator is standard awk. – Ed Morton Apr 12 '14 at 12:29
  • @EdMorton the `i <= length` in the for loop was not meant to be actual code; sorry for being unclear. However, `for (i = 1; i in chars; ++i)` is much better anyway and exactly what I need! Thank you, and for the other tips, too! –  Apr 12 '14 at 15:54
  • 1
    @EdMorton I incorporated your improvements into the answer. – John1024 Apr 12 '14 at 17:29