0

I'm trying to rearrange file1 which has been sorted by the last column as below

MEL P 20190731 0453 30.599
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
KDT P 20190731 0453 30.639
PAS P 20190731 0453 30.644
BDT P 20190731 0453 30.900
LAB P 20190731 0453 31.046
KLS P 20190731 0453 31.129
MEL S 20190731 0453 31.222
KDT S 20190731 0453 31.249
PAS S 20190731 0453 31.255
MEA S 20190731 0453 31.258
GRA P 20190731 0453 31.263
BDT S 20190731 0453 31.551
LAB S 20190731 0453 31.630
GRA S 20190731 0453 31.816

into output that I want where each line containing the same string in the first column are grouped next to each other along lines, such as

MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

while still respecting the order of the last column (notice that for instance MEL are now next to each other and that PUS location is not changed relative to the others).

I have tried this code to produce a key

awk '!array[$1]++ {print $1}' file1 > key

where then I tried to match it with file1 to be able to reorder the lines using

grep -Fwf key file > output

but nothing changes. Please help!

Inian
  • 80,270
  • 14
  • 142
  • 161
dex10
  • 109
  • 1
  • 9
  • I don't understand this criterion, "where each line containing the same character in the first column are grouped next to each other along lines"; wouldn't that imply that all lines starting with `M` should be grouped together? – Benjamin W. May 16 '20 at 19:35
  • 1
    you've stated '*same character in the first column are grouped next to each other*' but this isn't what the desired output is showing, eg, for the first letter 'M' ... why aren't the 'MEL' and 'MEA' rows (all start with letter 'M') '*grouped next to each other*'? same question about the 'PUS' and 'PAS' rows – markp-fuso May 16 '20 at 19:35
  • what I meant was I want that MEL is next to MEL, MEA to MEA, etc. but then the order as sorted in file1 is still respected (where MEL goes first, then PUS, then MEA, then KDT, and so on). Is that clearer? – dex10 May 16 '20 at 19:39
  • Would simple `sort -k1 -k5n file` do the trick? – James Brown May 16 '20 at 19:50
  • 1
    @JamesBrown nope, it rearranges everything. Now MEA is on top, because I believe it's primarily sorted by first column. – dex10 May 16 '20 at 20:08

4 Answers4

2

In awk:

$ awk '{
    if(!($1 in a))           # enumerate all unique $1 for looping in END
        n[++c]=$1
    a[$1]=a[$1] $0 ORS       # append records to hash keyed on $1
}
END {                        # after processing records
    for(i=1;i<=c;i++)        # loop 
        printf "%s",a[n[i]]  # and output
}' file

Output:

MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

It expects the data to be sorted on the last field.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    YES this is it! I get the idea from your explanation, thank you, but what does the line `a[$1]=a[$1] $0 ORS` means exactly? Sorry, I don't understand awk really well. – dex10 May 16 '20 at 20:24
  • It catenates the current record (line) `$0` and a newline (`ORS`) to the end of the related hash variable `a[$1]`. – James Brown May 16 '20 at 20:27
  • 1
    @EdMorton nice observation, thank you for the recommendation! – dex10 May 16 '20 at 20:48
2

With GNU sort for -s:

$ awk '!($1 in a){a[$1]=NR} {print a[$1], $0}' file | sort -s -k1,1n | cut -d' ' -f2-
MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

With any sort:

$ awk '!($1 in a){a[$1]=NR} {print a[$1], NR, $0}' file | sort -k1,1n -k2,2n | cut -d' ' -f3-
MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

I believe you are looking for a "stable sort" [0]. something like:

 sort -s -k5,5n -k1,1 file1 > output

(or maybe the -k keys the other way around)

https://en.wikipedia.org/wiki/Sorting_algorithm#Stability

from the man page

       -s, --stable
              stabilize sort by disabling last-resort comparison

tomc
  • 1,146
  • 6
  • 10
0

Beginners answer:

cat file1 | sort -s -t' ' makes more sense to me (so much simpler) than what I'm about to offer but if you insist on the weird sort in your desired output, below is a bash script that does what you want.

The strategy is to assign an incrementing counter to each line based on what is in the first field. If the first field contains an entry that is a duplicate of an earlier line, assign the counter for the previously encountered duplicate:

1 MEL P 20190731 0453 30.599
2 PUS P 20190731 0453 30.612
3 MEA P 20190731 0453 30.620
4 KDT P 20190731 0453 30.639
5 PAS P 20190731 0453 30.644
6 BDT P 20190731 0453 30.900
7 LAB P 20190731 0453 31.046
8 KLS P 20190731 0453 31.129
1 MEL S 20190731 0453 31.222
4 KDT S 20190731 0453 31.249
5 PAS S 20190731 0453 31.255
3 MEA S 20190731 0453 31.258
13 GRA P 20190731 0453 31.263
6 BDT S 20190731 0453 31.551
7 LAB S 20190731 0453 31.630
13 GRA S 20190731 0453 31.816

You can see that "MEL" appears in lines 1 and 9. Because "MEL" appears first, the incremental counter value of "1" is applied to both lines 1 and 9. Likewise, since "KDT" appears in both lines 4, and 10, they share the same counter value (in this case, 4). This incremental counter is determined by an kludgy and inefficient use of cat, grep, cut, and head.

Then, sort according to the incrementing counter. The result is:

1 MEL P 20190731 0453 30.599
1 MEL S 20190731 0453 31.222
2 PUS P 20190731 0453 30.612
3 MEA P 20190731 0453 30.620
3 MEA S 20190731 0453 31.258
4 KDT P 20190731 0453 30.639
4 KDT S 20190731 0453 31.249
5 PAS P 20190731 0453 30.644
5 PAS S 20190731 0453 31.255
6 BDT P 20190731 0453 30.900
6 BDT S 20190731 0453 31.551
7 LAB P 20190731 0453 31.046
7 LAB S 20190731 0453 31.630
8 KLS P 20190731 0453 31.129
13 GRA P 20190731 0453 31.263
13 GRA S 20190731 0453 31.816

cut out the counter and you have your desired output.

Here's the script. Run as $ /bin/bash stablenosort.sh file1

#!/bin/bash

# Description: Stable sorts (?) by first space-delimited field without
#   sorting by that field.
# Usage: stablenosort.sh [file]
# Ref/attrib:
#  [1]: Trim blank lines: https://stackoverflow.com/a/29549497/10850071

FILEIN="$1"

if [ -f "$FILEIN" ]; then
    LINES="$(cat "$FILEIN")";
else
    exit 1;
fi


while read line; do
    # Generate incrementing label from field1
    FIELD1="$(printf $line | awk '{print $1}' | head -n1)" # get field 1
    INCR_LABEL="$(cat "$FILEIN" | grep "$FIELD1" -n | cut -d':' -f1 | head -n1)" # Assign incrementing labels using FIELD1.
    OUTPUT="$OUTPUT""\n""$INCR_LABEL"" ""$line" # Prepend incrementing label to fields
done <<< "$LINES"

# Sort by incrementing label field then cut incrementing label
OUTPUT=$(printf "${OUTPUT}" | sort -t' ' -g -k1 | cut -d' ' -f2-)
OUTPUT=$(printf "${OUTPUT}" | awk 'NF' - ) # Trim blank lines. See [1].
printf "${OUTPUT}\n" # print final OUTPUT.
baltakatei
  • 113
  • 5
  • Run your script through http://shellcheck.net and read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) – Ed Morton May 17 '20 at 12:48