How to sort a file by line length and then alphabetically for the second key?

Question

Say I have a file:

ab
aa
c
aaaa

I would like it to be sorted like this

c
aa
ab
aaaa

That is to sort by line length and then alphabetically. Is that possible in bash?

We encourage questioners to show what they have tried so far to solve the problem themselves. — Cyrus, Dec 13 '20 at 14:20

Enlico · Accepted Answer · 2020-12-15T17:20:29.063

10

You can prepend the length of the line to each line, then do a numerical sorting, and finally cutting out the numbers

< your_file awk '{ print length($0), $0; }' | sort -n | cut -f2

You see that I've accomplished the sorting via sort -n, without doing any multi-key sorting. Honestly I was lucky that this worked:

I didn't think that lines could begin with numbers and so I expected sort -n to work because alphabetic and numeric sorting give the same result if all the strings are the same length, as is the case exaclty because we are sorting by the line length which I'm adding via awk.
It turns out everything works even if your input has lines starting with digits, the reason being that sort -n
1. sorts numerically on the leading numeric part of the lines;
2. in case of ties, it uses strcmp to compare the whole lines
Here's some demo:
```
$ echo -e '3 11\n3 2' | sort -n
3 11
3 2
# the `3 ` on both lines makes them equal for numerical sorting
# but `3 11` comes before `3 2` by `strcmp` before `1` comes before `2`

$ echo -e '3 11\n03 2' | sort -n
03 2
3 11
# the `03 ` vs `3 ` is a numerical tie,
# but `03 2` comes before `3 11` by `strcmp` because `0` comes before `3`
```
So the lucky part is that the , I included in the awk command inserts a space (actually an OFS), i.e. a non-digit, thus "breaking" the numeric sorting and letting the strcmp sorting kick in (on the whole lines which compare equal numerically, in this case).

Whether this behavior is POSIX or not, I don't know, but I'm using GNU coreutils 8.32's sort. Refer to this question of mine and this answer on Unix for details.

awk could do all itself, but I think using sort to sort is more idiomatic (as in, use sort to sort) and efficient, as explained in a comment (after all, why would you not expect that sort is the best performing tool in the shell to sort stuff?).

edited Dec 15 '20 at 17:20

answered Dec 13 '20 at 14:30

Enlico

23,259
6
48
102

1

_to sort is more idiomatic_ .... I think this is not really an argument. However, _sort_ can deal well with huge files, while with awk, everything would have to fit into memory if you want to use the built-in `sort` of awk; and if you go this far, I would not even use awk, but something like Perl or Ruby, which would be more suitable. So in the end, **this** would be for me an argument in favor of using `... | sort` BTW, in your solution, you should put the multi-key sorting right into the code example, since the OP requested that for equal-length key, sorting should be done alphabetically. – user1934428 Dec 14 '20 at 12:26
@user1934428, please, see if you like it now. As regards Ruby and Perl, I don't know them, so I don't even know how performing they are. You could add another answer, I guess. – Enlico Dec 14 '20 at 17:00
_in case of ties, it keeps using alphabetic sorting based on the rest of the line_ : I don't think this is true. In fact, the order is unspecified, and it just happens with your example, but could break in the general case. To demonstrate it, I add the option `-s`, which says "keep the original order if you can't decide based on the sorting criteria provided: `(echo 3 b; echo 3 a) | sort -n -s`. Actually, I think your original idea of explicitly specifying two sort keys, was better. – user1934428 Dec 15 '20 at 06:59
1

@user1934428, please, consider my edited answer in light of [the question I linked](https://stackoverflow.com/questions/65302655/does-sort-n-handle-ties-predictably-when-the-stable-option-is-not-provided-i/). – Enlico Dec 15 '20 at 13:47
I see! Thank you for explicitly pointing this out to me again. – user1934428 Dec 15 '20 at 14:06
@user1934428, thank you for pushing me to investigate. As I wrote in the answer I was lucky, I did not know of these details I know now. – Enlico Dec 15 '20 at 14:07

score 2 · Answer 2 · answered Dec 13 '20 at 14:31

Insert a length for the line using gawk (zero-filled to four places so it will sort correctly), sort by two keys (first the length, then the first word on the line), then remove the length:

gawk '{printf "%04d %s\n", length($0), $0}' | sort -k1 -k2 | cut -d' ' -f2-

If it must be bash:

while read -r line; do printf "%04d %s\n" ${#line} "${line}"; done | sort -k1 -k2 | (while read -r len remainder; do echo "${remainder}"; done)

score 1 · Answer 3 · answered Dec 13 '20 at 16:57

For GNU awk:

$ gawk '{
    a[length()][$0]++                             # hash to 2d array
}
END {
    PROCINFO["sorted_in"]="@ind_num_asc"          # first sort on length dim
    for(i in a) {
        PROCINFO["sorted_in"]="@ind_str_asc"      # and then on data dim
        for(j in a[i])
            for(k=1;k<=a[i][j];k++)               # in case there are duplicates
                print j
        # PROCINFO["sorted_in"]="@ind_num_asc"    # I don t think this is needed?
    }
}' file

Output:

c
aa
ab
aaaa
aaaaaaaaaa
aaaaaaaaaa

How to sort a file by line length and then alphabetically for the second key?

3 Answers3

Linked