How to sort by a text file by length of label

Question

I have a text file with 559 lines and I only need to sort the labels in a particular part of the file by the longest to the shortest string. I was thinking to use sort but I don't really have a delimiter to use and I am trying to determine the start and end to use flag -k.

Here is an example of my text file:

^(.*a)$0UMYBPEB(.*)$1$|\0ybpeb\1
^(.*a)$0UMYBPUK(.*)$1$|\0yuk  \1
^(.*a)$0UMYBPUKE(.*)$1$|\0yuke \1
^(.*a)$0USAAHPERD(.*)$1$|\0aahpe\1
^(.*a)$0USAASC(.*)$1$|\0aasc \1
^(.*a)$0USAATF(.*)$1$|\0aatf \1
^(.*a)$0USABARIS(.*)$1$|\0abar \1
^(.*a)$0USABOR(.*)$1$|\0abor \1
^(.*a)$0USACA(.*)$1$|\0aca  \1
^(.*a)$0USACI(.*)$1$|\0aci  \1
^(.*a)$0USACMLA(.*)$1$|\0acmla\1
^(.*a)$0USACSANZ(.*)$1$|\0acsan\1
^(.*a)$0USACTA(.*)$1$|\0acta \1
^(.*a)$0USACTACLASS(.*)$1$|\0cass \1
^(.*a)$0USAD(.*)$1$|\0adbus\1
^(.*a)$0USADAMMATTHEW(.*)$1$|\0adam \1
^(.*a)$0USAEA(.*)$1$|\0aea  \1
^(.*a)$0USAFAS(.*)$1$|\0afas \1
^(.*a)$0USAFRICAN(.*)$1$|\0afric\1
^(.*a)$0USAGI(.*)$1$|\0agi  \1
^(.*a)$0USAGO(.*)$1$|\0ago  \1

Notice the labels I am referring to are after the first $ before (.*)

The result I want is the longest to the shortest label:

^(.*a)$0USADAMMATTHEW(.*)$1$|\0adam \1
^(.*a)$0USACTACLASS(.*)$1$|\0cass \1
^(.*a)$0USAFRICAN(.*)$1$|\0afric\1
^(.*a)$0USAAHPERD(.*)$1$|\0aahpe\1
^(.*a)$0USACSANZ(.*)$1$|\0acsan\1
^(.*a)$0UMYBPUKE(.*)$1$|\0yuke \1
^(.*a)$0USABARIS(.*)$1$|\0abar \1
^(.*a)$0USACMLA(.*)$1$|\0acmla\1
^(.*a)$0UMYBPEB(.*)$1$|\0ybpeb\1
^(.*a)$0UMYBPUK(.*)$1$|\0yuk  \1
^(.*a)$0USAFAS(.*)$1$|\0afas \1
^(.*a)$0USAASC(.*)$1$|\0aasc \1
^(.*a)$0USAATF(.*)$1$|\0aatf \1
^(.*a)$0USABOR(.*)$1$|\0abor \1
^(.*a)$0USACTA(.*)$1$|\0acta \1
^(.*a)$0USACA(.*)$1$|\0aca  \1
^(.*a)$0USACI(.*)$1$|\0aci  \1
^(.*a)$0USAEA(.*)$1$|\0aea  \1
^(.*a)$0USAGI(.*)$1$|\0agi  \1
^(.*a)$0USAGO(.*)$1$|\0ago  \1
^(.*a)$0USAD(.*)$1$|\0adbus\1

the copy past did not work for me but I think you get the general idea i hope — user7011225, May 16 '17 at 18:44
If I could get the longest to the shortest line I would be happy — user7011225, May 16 '17 at 18:51
See: [Sort a text file by line length including spaces](http://stackoverflow.com/q/5917576/3776858) — Cyrus, May 16 '17 at 19:13

score 2 · Answer 1 · answered May 16 '17 at 18:59

You can use perl like so.

perl -ne 'push @Lines,$_;}{print (sort { length($b) <=> length($a) } @Lines)' file

Each line is read into array @Lines.

}{ Has a special meaning for end of file.

sort { length($b) <=> length($a) } @Lines sorts the array using the special variables $a and $b for the array.

print prints the sorted array.

score 2 · Answer 2 · answered May 16 '17 at 19:53

awk (and friends) to the rescue

awk '{print length($0) "\t" $0}' file | sort -nr | cut -f2-

^(.*a)-bashUSADAMMATTHEW(.*)$|\0adam \1
^(.*a)-bashUSACTACLASS(.*)$|\0cass \1
^(.*a)-bashUSAFRICAN(.*)$|\0afric\1
^(.*a)-bashUSAAHPERD(.*)$|\0aahpe\1
^(.*a)-bashUSACSANZ(.*)$|\0acsan\1
^(.*a)-bashUSABARIS(.*)$|\0abar \1
^(.*a)-bashUMYBPUKE(.*)$|\0yuke \1
^(.*a)-bashUSACMLA(.*)$|\0acmla\1
^(.*a)-bashUMYBPUK(.*)$|\0yuk  \1
^(.*a)-bashUMYBPEB(.*)$|\0ybpeb\1
^(.*a)-bashUSAFAS(.*)$|\0afas \1
^(.*a)-bashUSACTA(.*)$|\0acta \1
^(.*a)-bashUSABOR(.*)$|\0abor \1
^(.*a)-bashUSAATF(.*)$|\0aatf \1
^(.*a)-bashUSAASC(.*)$|\0aasc \1
^(.*a)-bashUSAGO(.*)$|\0ago  \1
^(.*a)-bashUSAGI(.*)$|\0agi  \1
^(.*a)-bashUSAEA(.*)$|\0aea  \1
^(.*a)-bashUSACI(.*)$|\0aci  \1
^(.*a)-bashUSACA(.*)$|\0aca  \1
^(.*a)-bashUSAD(.*)$|\0adbus\1

RomanPerekhrest · Answer 3 · 2017-05-17T13:42:51.980

With single gawk (GNU awk):

awk '{a[length,NR]=$0}END{n=asorti(a,dest); for(;n>0;n--) print a[dest[n]]}' file

The output:

^(.*a)$0USADAMMATTHEW(.*)$1$|\0adam \1
^(.*a)$0USACTACLASS(.*)$1$|\0cass \1
^(.*a)$0USAAHPERD(.*)$1$|\0aahpe\1
^(.*a)$0USAFRICAN(.*)$1$|\0afric\1
^(.*a)$0USABARIS(.*)$1$|\0abar \1
^(.*a)$0UMYBPUKE(.*)$1$|\0yuke \1
^(.*a)$0USACSANZ(.*)$1$|\0acsan\1
^(.*a)$0UMYBPUK(.*)$1$|\0yuk  \1
^(.*a)$0USACMLA(.*)$1$|\0acmla\1
^(.*a)$0UMYBPEB(.*)$1$|\0ybpeb\1
^(.*a)$0USABOR(.*)$1$|\0abor \1
^(.*a)$0USAATF(.*)$1$|\0aatf \1
^(.*a)$0USAASC(.*)$1$|\0aasc \1
^(.*a)$0USAFAS(.*)$1$|\0afas \1
^(.*a)$0USACTA(.*)$1$|\0acta \1
^(.*a)$0USACA(.*)$1$|\0aca  \1
^(.*a)$0USAGO(.*)$1$|\0ago  \1
^(.*a)$0USAGI(.*)$1$|\0agi  \1
^(.*a)$0USAEA(.*)$1$|\0aea  \1
^(.*a)$0USACI(.*)$1$|\0aci  \1
^(.*a)$0USAD(.*)$1$|\0adbus\1

length - length of the line

asorti(source [, dest [, how ] ]) - sorts the array indices (in ascending order by default)

dest - the result array of sorted indices

James Brown · Answer 4 · 2017-05-16T21:24:56.680

1

Here's another one in GNU awk:

$ gawk '
function cmp_val_len(i1,v1,i2,v2) {   # define length comparing function for asort
    return(length(v2) - length(v1)) 
}
{
    a[NR]=$0                          # hash records to a
}
END {
    n=asort(a,b,"cmp_val_len")        # sort the records using defined function
    for(i=1;i<=n;i++)                 # loop and
        print b[i]                    # output
}
' file

Output (only the start):

^(.*a)$0USADAMMATTHEW(.*)$1$|\0adam \1
^(.*a)$0USACTACLASS(.*)$1$|\0cass \1
^(.*a)$0USAAHPERD(.*)$1$|\0aahpe\1
^(.*a)$0USAFRICAN(.*)$1$|\0afric\1
^(.*a)$0UMYBPUKE(.*)$1$|\0yuke \1
....

edited May 16 '17 at 21:24

answered May 16 '17 at 21:14

James Brown

36,089
7
43
59

Nice answer, completely forgot you can define the sort function for asort. – 123 May 16 '17 at 21:30
1

@123 Me too, took me 15 minutes to remember how it worked again. :D – James Brown May 16 '17 at 21:31

How to sort by a text file by length of label

4 Answers4