Rearranging lines starting with the same string

Question

I'm trying to rearrange file1 which has been sorted by the last column as below

MEL P 20190731 0453 30.599
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
KDT P 20190731 0453 30.639
PAS P 20190731 0453 30.644
BDT P 20190731 0453 30.900
LAB P 20190731 0453 31.046
KLS P 20190731 0453 31.129
MEL S 20190731 0453 31.222
KDT S 20190731 0453 31.249
PAS S 20190731 0453 31.255
MEA S 20190731 0453 31.258
GRA P 20190731 0453 31.263
BDT S 20190731 0453 31.551
LAB S 20190731 0453 31.630
GRA S 20190731 0453 31.816

into output that I want where each line containing the same string in the first column are grouped next to each other along lines, such as

MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

while still respecting the order of the last column (notice that for instance MEL are now next to each other and that PUS location is not changed relative to the others).

I have tried this code to produce a key

awk '!array[$1]++ {print $1}' file1 > key

where then I tried to match it with file1 to be able to reorder the lines using

grep -Fwf key file > output

but nothing changes. Please help!

I don't understand this criterion, "where each line containing the same character in the first column are grouped next to each other along lines"; wouldn't that imply that all lines starting with `M` should be grouped together? — Benjamin W., May 16 '20 at 19:35
you've stated '*same character in the first column are grouped next to each other*' but this isn't what the desired output is showing, eg, for the first letter 'M' ... why aren't the 'MEL' and 'MEA' rows (all start with letter 'M') '*grouped next to each other*'? same question about the 'PUS' and 'PAS' rows — markp-fuso, May 16 '20 at 19:35
what I meant was I want that MEL is next to MEL, MEA to MEA, etc. but then the order as sorted in file1 is still respected (where MEL goes first, then PUS, then MEA, then KDT, and so on). Is that clearer? — dex10, May 16 '20 at 19:39
@JamesBrown nope, it rearranges everything. Now MEA is on top, because I believe it's primarily sorted by first column. — dex10, May 16 '20 at 20:08

James Brown · Accepted Answer · 2020-05-16T20:16:13.773

2

In awk:

$ awk '{
    if(!($1 in a))           # enumerate all unique $1 for looping in END
        n[++c]=$1
    a[$1]=a[$1] $0 ORS       # append records to hash keyed on $1
}
END {                        # after processing records
    for(i=1;i<=c;i++)        # loop 
        printf "%s",a[n[i]]  # and output
}' file

Output:

MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

It expects the data to be sorted on the last field.

edited May 16 '20 at 20:16

answered May 16 '20 at 20:06

James Brown

36,089
7
43
59

1

YES this is it! I get the idea from your explanation, thank you, but what does the line `a[$1]=a[$1] $0 ORS` means exactly? Sorry, I don't understand awk really well. – dex10 May 16 '20 at 20:24
It catenates the current record (line) `$0` and a newline (`ORS`) to the end of the related hash variable `a[$1]`. – James Brown May 16 '20 at 20:27
1

@EdMorton nice observation, thank you for the recommendation! – dex10 May 16 '20 at 20:48

Ed Morton · Answer 2 · 2020-05-16T20:23:39.280

With GNU sort for -s:

$ awk '!($1 in a){a[$1]=NR} {print a[$1], $0}' file | sort -s -k1,1n | cut -d' ' -f2-
MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

With any sort:

$ awk '!($1 in a){a[$1]=NR} {print a[$1], NR, $0}' file | sort -k1,1n -k2,2n | cut -d' ' -f3-
MEL P 20190731 0453 30.599
MEL S 20190731 0453 31.222
PUS P 20190731 0453 30.612
MEA P 20190731 0453 30.620
MEA S 20190731 0453 31.258
KDT P 20190731 0453 30.639
KDT S 20190731 0453 31.249
PAS P 20190731 0453 30.644
PAS S 20190731 0453 31.255
BDT P 20190731 0453 30.900
BDT S 20190731 0453 31.551
LAB P 20190731 0453 31.046
LAB S 20190731 0453 31.630
KLS P 20190731 0453 31.129
GRA P 20190731 0453 31.263
GRA S 20190731 0453 31.816

tomc · Answer 3 · 2020-05-16T20:02:41.897

0

I believe you are looking for a "stable sort" [0]. something like:

 sort -s -k5,5n -k1,1 file1 > output

(or maybe the -k keys the other way around)

https://en.wikipedia.org/wiki/Sorting_algorithm#Stability

from the man page

       -s, --stable
              stabilize sort by disabling last-resort comparison

edited May 16 '20 at 20:02

answered May 16 '20 at 19:53

tomc

1,146
6
10

score 0 · Answer 4 · answered May 16 '20 at 21:45

Beginners answer:

cat file1 | sort -s -t' ' makes more sense to me (so much simpler) than what I'm about to offer but if you insist on the weird sort in your desired output, below is a bash script that does what you want.

The strategy is to assign an incrementing counter to each line based on what is in the first field. If the first field contains an entry that is a duplicate of an earlier line, assign the counter for the previously encountered duplicate:

1 MEL P 20190731 0453 30.599
2 PUS P 20190731 0453 30.612
3 MEA P 20190731 0453 30.620
4 KDT P 20190731 0453 30.639
5 PAS P 20190731 0453 30.644
6 BDT P 20190731 0453 30.900
7 LAB P 20190731 0453 31.046
8 KLS P 20190731 0453 31.129
1 MEL S 20190731 0453 31.222
4 KDT S 20190731 0453 31.249
5 PAS S 20190731 0453 31.255
3 MEA S 20190731 0453 31.258
13 GRA P 20190731 0453 31.263
6 BDT S 20190731 0453 31.551
7 LAB S 20190731 0453 31.630
13 GRA S 20190731 0453 31.816

You can see that "MEL" appears in lines 1 and 9. Because "MEL" appears first, the incremental counter value of "1" is applied to both lines 1 and 9. Likewise, since "KDT" appears in both lines 4, and 10, they share the same counter value (in this case, 4). This incremental counter is determined by an kludgy and inefficient use of cat, grep, cut, and head.

Then, sort according to the incrementing counter. The result is:

1 MEL P 20190731 0453 30.599
1 MEL S 20190731 0453 31.222
2 PUS P 20190731 0453 30.612
3 MEA P 20190731 0453 30.620
3 MEA S 20190731 0453 31.258
4 KDT P 20190731 0453 30.639
4 KDT S 20190731 0453 31.249
5 PAS P 20190731 0453 30.644
5 PAS S 20190731 0453 31.255
6 BDT P 20190731 0453 30.900
6 BDT S 20190731 0453 31.551
7 LAB P 20190731 0453 31.046
7 LAB S 20190731 0453 31.630
8 KLS P 20190731 0453 31.129
13 GRA P 20190731 0453 31.263
13 GRA S 20190731 0453 31.816

cut out the counter and you have your desired output.

Here's the script. Run as $ /bin/bash stablenosort.sh file1

#!/bin/bash

# Description: Stable sorts (?) by first space-delimited field without
#   sorting by that field.
# Usage: stablenosort.sh [file]
# Ref/attrib:
#  [1]: Trim blank lines: https://stackoverflow.com/a/29549497/10850071

FILEIN="$1"

if [ -f "$FILEIN" ]; then
    LINES="$(cat "$FILEIN")";
else
    exit 1;
fi


while read line; do
    # Generate incrementing label from field1
    FIELD1="$(printf $line | awk '{print $1}' | head -n1)" # get field 1
    INCR_LABEL="$(cat "$FILEIN" | grep "$FIELD1" -n | cut -d':' -f1 | head -n1)" # Assign incrementing labels using FIELD1.
    OUTPUT="$OUTPUT""\n""$INCR_LABEL"" ""$line" # Prepend incrementing label to fields
done <<< "$LINES"

# Sort by incrementing label field then cut incrementing label
OUTPUT=$(printf "${OUTPUT}" | sort -t' ' -g -k1 | cut -d' ' -f2-)
OUTPUT=$(printf "${OUTPUT}" | awk 'NF' - ) # Trim blank lines. See [1].
printf "${OUTPUT}\n" # print final OUTPUT.

Run your script through http://shellcheck.net and read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) — Ed Morton, May 17 '20 at 12:48

Rearranging lines starting with the same string

4 Answers4