1

I need to compare two versions of the same file. Both are tab-separated and have this form:

<filename1><tab><Marker11><tab><Marker12>...
<filename2><tab><Marker21><tab><Marker22><tab><Marker22>...

So each row has a different number of markers (the number varies between 1 and 10) and they all come from a small set of possible markers. So a file looks like this:

fileX<tab>Z<tab>M<tab>A
fileB<tab>Y
fileM<tab>M<tab>C<tab>B<tab>Y

What I need is:

  1. Sort the file by rows
  2. Sort the markers in each row so that they are in alphabetical order

So for the example above, the result would be

fileB<tab>Y
fileM<tab>B<tab>C<tab>M<tab>Y
fileX<tab>A<tab>M<tab>Z

It's easy to do #1 using sort but how do I do #2?

UPDATE: It's not a duplicate of this post since my rows are of different length and I need each rows (the entries after the filename) sorted individually, i.e. the only column that gets preserved is the first one.

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
I Z
  • 5,719
  • 19
  • 53
  • 100
  • Possible duplicate of [Using bash to sort data horizontally](https://stackoverflow.com/questions/25062169/using-bash-to-sort-data-horizontally) – binduck Jul 13 '17 at 16:53

2 Answers2

1

awk solution:

awk 'BEGIN{ FS=OFS="\t"; PROCINFO["sorted_in"]="@ind_str_asc" }
     { split($0,b,FS); delete b[1]; asort(b); r=""; 
         for(i in b) r=(r!="")? r OFS b[i] : b[i]; a[$1] = r 
     }
     END{ for(i in a) print i,a[i] }' file

The output:

fileB   Y
fileM   B   C   M   Y
fileX   A   M   Z

  • PROCINFO["sorted_in"]="@ind_str_asc" - sort mode

  • split($0,b,FS); - split the line into array b by FS (field separator)

  • asort(b) - sort marker values

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

All you need is:

awk '
{ for (i=2;i<=NF;i++) arr[$1][$i] }
END {
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in arr) {
        printf "%s", i
        for (j in arr[i]) {
            printf "%s%s, OFS, arr[i][j]
        }
        print ""
    }
}
' file

The above uses GNU awk for true multi-dimensional arrays plus sorted_in

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Good answer. And it would be nice if once (in xxx years :) ) predictable iterations over sorted arrays in awk would be POSIX. I would just recommend to explicitly use `gawk` instead of `awk`. (Which is also a kind of advertisment ;) ) – hek2mgl Jul 13 '17 at 19:41
  • Actually it should not break anything when sorted arrays get added under the hood. Python3.7 was doing the same with the `dict` type. Code that assumes the array to be unsorted should still work. – hek2mgl Jul 13 '17 at 19:44
  • The problem with default sorted arrays is there is no order that's better than any other order (alphabetic? numeric? first in? incrementing? decrementing? etc.) so hash order is best as the default since it's most efficient. – Ed Morton Jul 13 '17 at 19:45
  • 1
    I see. The impressive performance of `awk` should definitely stay one of the major goals. – hek2mgl Jul 13 '17 at 19:49