0

I have a (previously sorted) text file that, consisting of either a dash - or a single alphabetical character. I'd greatly appreciate any help in better understanding the proper awk syntax to move through each column of the text file and retain only the first non-dash character in each row if a non-dash character exists, or else to retain that dash character if no alphabetical character exists. The result in either situation would be a single row of text. Files are always formatted in such a way that every row has the same number of columns, and the first non-dash character is always preferred, regardless if other alphabetical characters exist in 'lower' rows.

Two examples to clarify: given this text file:

# printf 't---k-\ncha---\n--nn--\n--ab-s\n'

t---k-
cha---
--nn--
--ab-s

the program would start in the first column, and because the first character is not a dash, it would retain a t. we'd then proceed to the next column, wherein the first row of information is a dash, thus advance to the second row, where an h is selected. you'd then advance to column three, and have to move to the third row to select the n character, etc. The expected string to report is:

thanks.

In the second example, we have a very similar arrangement of text, with one exception:

#printf 't-----\ncha---\n--nn--\n--ab-s\n'

t-----
cha---
--nn--
--ab-s

Notice there is no alphabetical character present in the fourth column in this second example. Because no such character exists, we would return a dash in that position. Thus the expected output would be:

than-s

This post highlights a pandas approach somewhat similar to what I'm trying to achieve, and this post similarly offers a solution via numpy, but I believe they both require functions applicable for integers, whereas I have a data set consisting of alphabetical characters. This post similarly explains a method to apply a function in column-wise fashion using awk, which is closer to what I'm after, as does this other awk post. It seems to me that the awk method I'm after will similarly require me to declare a column-wise approach, which I think is stated in the beginning of the function as:

awk '{for (i=1;i<=NF;i++){

... where I'm stuck is trying to identify the next argument of the function, where I think I'm after some type of if/else statement. That's the part where I'm hoping to get further clarification.

Perhaps the solution need not be done via awk - I'm certainly open to other strategies that rely on any language, so if Python or Perl or some other strategy is clearly the more appropriate language, thank you for the education.

Thanks for your consideration

Devon O'Rourke
  • 237
  • 2
  • 11

2 Answers2

2

Using any awk in any shell on every Unix box:

$ cat tst.awk
{
    numChars = length($0)
    for (i=1; i<=numChars; i++) {
        if ( chars[i] ~ /^-?$/ ) {
            chars[i] = substr($0,i,1)
        }
    }
}
END {
    for (i=1; i<=numChars; i++) {
        printf "%s%s", chars[i], (i<numChars ? "" : ORS)
    }
}

$ awk -f tst.awk file1
thanks

$ awk -f tst.awk file2
than-s
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    This is very nice as this will work in any awk version, this is essentially same algorithm as @j1-lee – anubhava Jun 02 '21 at 07:11
  • 1
    Yup, it's just POSIX compliant with a couple of other fixes, e.g. won't fail if the first char is `0`, will produce a valid POSIX text file as output, and isn't doing `printf `. – Ed Morton Jun 02 '21 at 11:50
1

You may use this gnu-awk solution:

awk '
BEGIN{FS=""}
{
   for (i=1; i<=NF; ++i)
      a[i]=a[i] $i
}
END {
   s = ""
   for (i=1; i in a; ++i)
      s = s gensub(/^((-)+|-*([^-]).*)$/, "\\2\\3", "1", a[i])
   print s
}' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Thanks Ed. In my testing it didn't shuffle order for 2 samples but I think you are right and have fixed my code. – anubhava Jun 02 '21 at 05:54
  • I have actually changed the algorithm now. – anubhava Jun 02 '21 at 06:19
  • 1
    wrt `in my testing it didn't...`, yeah, I know that's how people miss that detail because often for small data samples the output order co-incidentally ends up being what they expected but then when you move to larger samples in the real data it's a mess. Just think about it though - what SHOULD the order be for `in`? Should it be first-in, sorted alphabetically by index, sorted numerically by index, sorted alphabetically by value, numerically by value, or something else? The answer is there is no universally correct order and so the order is left up to the implementation, usually hash order. – Ed Morton Jun 02 '21 at 11:42