Sort text columns by number of lines in bash

Question

Assume a text file that contains x number of string columns.

$cat file # where x=3
foo  foo  foo
bar  bar  bar
     baz  baz
     qux

Is there a way in bash to sort these columns by the number text strings (i.e., filled rows) they contain, while maintaining the internal order of rows in each column?

$sought_command file
foo  foo  foo
bar  bar  bar
baz  baz
qux

Essentially, the column with the most number of rows is to be first, the column with the second-most number of rows is to be second, etc.

(This task would be easy to implement via R, but I am wondering about a solution via bash.)

EDIT 1:

Here are some additional details: Every column contains at least one text string (i.e., one filled row). The text strings may constitute any alphanumeric combination and have any length (but obviously do not contain spaces). The output columns must not have blank rows inserted. There is no a priori limitation on the column delimiter, as long as it remains consistent across the table.

All that is needed for this task is to shift the columns around as-is such that they are sorted by column length. (I know that implementing this in bash sounds easier than it actually is.)

Could a column have gaps? That is, in line 5 of your file, could column 2 have no entry while columns 1 and 3 do have entries? Suppose 2 columns have the same number of entries; should the columns be kept in sequence (so if both columns 2 and 3 have 4 values, column 2 should appear before column 3)? — Jonathan Leffler, Nov 28 '16 at 18:54
One problem is simply determining which columns are present and which are not. Awk, by default, treats any sequence of blanks as equivalent to a single blank, at least as far as separating fields. Are the columns all of a uniform width (you show 3 characters per entry in the sample data; is the real data equally well disciplined)? Can you munge the data so that there is a marker (e.g. a dash `-`) in the otherwise empty columns? That would make the job a lot easier. Or are column values tab separated? — Jonathan Leffler, Nov 28 '16 at 18:57
@JonathanLeffler All columns contain at least one text string (i.e., at least one row). I updated my question to provide the requested details. — Michael Gruenstaeudl, Nov 28 '16 at 19:09
bash is a shell. A shell is an environment from which to create/destroy files/processes and sequence calls to tools, it is not a tool to manipulate text as you're trying to do. So, when you say you want to do it "in bash" - what exactly do you mean? Also, are your columns fixed-width or tab-separated or what? — Ed Morton, Nov 28 '16 at 19:12
Supposing line N has columns `abcdef x pqrstuvw` and line N+1 contains `m n`. (SO comments show multiple blanks as one.) Awk will, by default, read line N as having 3 fields (`$1`, `$2`, `$3`), and line N+1 as having 2 fields (`$1` and `$2`). How can humans know whether the `m` is in column 1 or 2, and the `n` is in column 2 or 3? Are there always 2 spaces between columns? But your sample data doesn't show that? For anything to be doable, there has to be a way of looking at each line and determining which columns are present and which are absent. How does a human tell? — Jonathan Leffler, Nov 28 '16 at 19:16
@EdMorton You are quite right - I should be more specific. I am looking for a solution that has the speed of `awk`, `sed` or `grep`, which are often invoked via `bash` or a similar shell. — Michael Gruenstaeudl, Nov 28 '16 at 19:16
OK then you're looking for an awk solution since sed is for simple substitutions on individual lines and grep is to print a string matching a regexp and neither of those is what you want so that leaves awk and that solution will be independent of whatever shell you use, bash or otherwise. So - what separates the "columns" in your sample input? Tabs? Or are they fixed-width? Or something else? — Ed Morton, Nov 28 '16 at 19:19
@MichaelGruenstaeudl: Also noticed a pattern, first two rows for all the columns being the same? Is the pattern meant that way or it can be random? — Inian, Nov 28 '16 at 19:27

score 4 · Accepted Answer · answered Nov 28 '16 at 19:30

4

With GNU awk for sorted_in and assuming your columns are tab-separated:

$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
    for (i=1; i<=NF; i++) {
        if ($i ~ /[^[:space:]]/) {
            cell[NR,i] = $i
            cnt[i]++
        }
    }
    next
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (row=1; row<=NR; row++) {
        c=0
        for (col in cnt) {
            printf "%s%s", (c++?OFS:""), cell[row,col]
        }
        print ""
    }
}

$ awk -f tst.awk file
foo     foo     foo
bar     bar     bar
baz     baz
qux

answered Nov 28 '16 at 19:30

Ed Morton

188,023
17
78
185

1

Looks simpler than the transpose logic! `+1` – Inian Nov 28 '16 at 19:34
1

@EdMorton Works like a charm. Thanks! – Michael Gruenstaeudl Nov 28 '16 at 19:36
1

Way better than my attempt ++ – anubhava Nov 28 '16 at 19:37
Only caveat is that OP didn't mention input is tab separated :) – anubhava Nov 28 '16 at 19:47
1

Right, since the OP hasn't told us, we have to make SOME kind of assumption about how columns are separated so I went with that vs fixed width or anything else. – Ed Morton Nov 28 '16 at 19:55

anubhava · Answer 2 · 2016-11-28T19:33:35.087

1

Create a function called transpose like this first:

transpose() {
   awk -v FPAT='[^[:blank:]]+|[ \t]{3,}' '{
     for (i=1; i<=NF; i++)
        a[i,NR]=$i
        max=(max<NF?NF:max)
     }
     END {for (i=1; i<=max; i++)
        for (j=1; j<=NR; j++)
           printf "%s%s", a[i,j], (j==NR?ORS:OFS)
   }'
}

Then use it as:

transpose < file | awk '{print NF "\t" $0}' | sort -k1nr | cut -f2- | transpose

foo foo foo
bar bar bar
baz baz
qux

Steps are:

Call transpose function to transpose column into rows
Use awk to add # of fields at the start of each line
Use sort in reverse numerical order of first column
Use cut to get rid of first column
Call transpose again to transpose column into rows to get original order

PS: Due to use of FPAT we will need gnu-awk here.

edited Nov 28 '16 at 19:33

answered Nov 28 '16 at 19:15

anubhava

761,203
64
569
643

1

It looks good in theory. In practice, it only works with the provided example because all the columns begin with `foo bar`, and you don't notice columns being merged. Replace `bar bar bar` with `bir bar ber` and you'll see. It looks like you just replace first whitespaces by an empty string, just like my sed joke. – Eric Duminil Nov 28 '16 at 19:23
1

@anubhava Excellent explanations! – Michael Gruenstaeudl Nov 28 '16 at 19:37

score 1 · Answer 3 · answered Nov 28 '16 at 19:19

with unix toolset

$ tr '\t' '\n' <file                  | 
  pr -4ts                             |  
  awk '{print gsub(/-/,"-") "\t" $0}' | 
  sort -k1n                           | 
  cut -f2-                            | 
  tr '\t' '\n'                        | 
  pr -3ts

foo     foo     foo
bar     bar     bar
baz     baz     -
qux     -       -

assumes the columns are tab separated and the missing values are represented with "-". The magic numbers 4 and 3 are the number of rows and columns respectively.

Used this as the input file

$ cat file
foo     foo     foo
bar     bar     bar
-       baz     baz
-       qux     -

score 0 · Answer 4 · edited May 23 '17 at 12:01

0

sed -e 's/^ *//' columns.txt
# =>
# foo  foo  foo
# bar  bar  bar
# baz  baz
# qux

I'll be here all week! :D

On a more serious note, you might want to transpose your columns with bash, with either awk or rs. This will make it much easier to sort your columns (now rows), and transpose them back again. Multiple spaces could pose a problem to awk, though.

edited May 23 '17 at 12:01

Community

1
1

answered Nov 28 '16 at 19:17

Eric Duminil

52,989
9
71
124

1

It's just a trick using the fact that all the columns begin with the same values. Just removing first whitespaces at the beginning of each line gives the impression to have sorted the columns by height. – Eric Duminil Nov 28 '16 at 19:32

Sort text columns by number of lines in bash

4 Answers4