How to sort a file based on key name instead of its position in unix?

Question

I want to sort a file in Unix and for that I am using command

sort file --field-separator=' ' --key=7,7

But position of this field is not fixed, sometimes it can be 7th field or sometimes 6th or 8th field in the line.

Do we know if its possible to sort the file based on field name, something like

sort file --field-separator=' ' --keyname=<my_unique_id>

File looks something like this, I want to sort on the basis of party_id

status_date="2000-01-31" ref_date="2021-03-31" ead_percent="0.00365316" accounting_standard="IFRS" party_default_status_cd="NOTDFLT" party_id="36113477" v_src_system_id="ABC"
status_date="2002-12-31" ref_date="2021-03-31" ead_percent="1" accounting_standard="IFRS" orig_src_system_id="GRD" party_default_status_cd="UNLIKE" party_id="36053415" v_src_system_id="XYZ"

Columns consist out of index, not keys, and you didn't show what your input file looks like. — mpapec, Jan 12 '22 at 11:19

score 2 · Accepted Answer · answered Jan 12 '22 at 11:52

2

sort doesn't have a concept of named keys, but you can perform a Schwartzian transform to temporarily add the key as a prefix to the line, sort on the first field, then discard it.

sed 's/\(.*\)\(party_id="[^"]*"\)/\2    \1\2/' file |
sort -t '   ' -k1,1 |
cut -f2-

(where the whitespace between the two first back references and in the sort -t argument is a literal tab, which however Stack Overflow renders as a sequence of spaces).

answered Jan 12 '22 at 11:52

tripleee

175,061
34
275
318

You don't have to use a tab at all; some things would be easier with a common punctuation character like comma or colon, for example, but then there is more risk that the separator clashes with something else in your data. – tripleee Jan 12 '22 at 11:54
Thanks tripleee. It works. Can you please explain what line 1 and 3 are doing? Also in case if i need to add another key(s) for multilevel sort, what changes do I need to do in your command? – Neha Jan 12 '22 at 12:37
A "Schwartzian transform" is a specific perl implementation of the common decorate/sort/undecorate idiom that existed for years before Randall showed an example of using it in perl in response to a usenet question. – Ed Morton Jan 12 '22 at 12:41
The Wikipedia link explains the idea in quite some detail. – tripleee Jan 12 '22 at 12:46
@EdMorton True, in its strictest sense the term describes the Perl idiom, but I have seen it applied more broadly and the term seems to have caught on more generally. See e.g. https://stackoverflow.com/questions/29548505/equivalent-of-pythons-list-sort-with-key-schwartzian-transform – tripleee Jan 12 '22 at 12:47
1

Right, I'm just saying what you have isn't that, it's a plain old Decorate/Sort/Undecorate as it's not the perl implementation of that idiom. Fingers crossed that term doesn't REALLY catch on in other applications or it'd be exactly the same as if Joe Smith coded a Binary Sort in perl or any other tool/language and suddenly Binary Sort became known as a Smiths Transform! – Ed Morton Jan 12 '22 at 13:12
1

Too late, it's already in Haskell https://hackage.haskell.org/package/base-4.16.0.0/docs/src/Data.OldList.html#sortOn – karakfa Jan 12 '22 at 15:42
2

Ugh. Well, in another 10 years all words will have been replaced with emoji's anyway and then it won't matter anymore :-). – Ed Morton Jan 12 '22 at 17:14

Ed Morton · Answer 2 · 2022-01-12T13:46:35.757

Using the Decorate/Sort/Undecorate idiom and assuming that, like in the example you provided, your quoted strings don't contain blanks, =, or ":

Decorate:

$ awk -F'[ ="]+' -v OFS='\t' -v keyname='party_id' '{for (i=1; i<NF; i+=2) if ($i == keyname) { print $(i+1), $0; next} }' file
36113477        status_date="2000-01-31" ref_date="2021-03-31" ead_percent="0.00365316" accounting_standard="IFRS" party_default_status_cd="NOTDFLT" party_id="36113477" v_src_system_id="ABC"
36053415        status_date="2002-12-31" ref_date="2021-03-31" ead_percent="1" accounting_standard="IFRS" orig_src_system_id="GRD" party_default_status_cd="UNLIKE" party_id="36053415" v_src_system_id="XYZ"

Decorate then Sort:

$ awk -F'[ ="]+' -v OFS='\t' -v keyname='party_id' '{for (i=1; i<NF; i+=2) if ($i == keyname) { print $(i+1), $0; next} }' file |
    sort -k1,1n
36053415        status_date="2002-12-31" ref_date="2021-03-31" ead_percent="1" accounting_standard="IFRS" orig_src_system_id="GRD" party_default_status_cd="UNLIKE" party_id="36053415" v_src_system_id="XYZ"
36113477        status_date="2000-01-31" ref_date="2021-03-31" ead_percent="0.00365316" accounting_standard="IFRS" party_default_status_cd="NOTDFLT" party_id="36113477" v_src_system_id="ABC"

Decorate then Sort then Undecorate:

$ awk -F'[ ="]+' -v OFS='\t' -v keyname='party_id' '{for (i=1; i<NF; i+=2) if ($i == keyname) { print $(i+1), $0; next} }' file |
    sort -k1,1n |
    cut -f2-
status_date="2002-12-31" ref_date="2021-03-31" ead_percent="1" accounting_standard="IFRS" orig_src_system_id="GRD" party_default_status_cd="UNLIKE" party_id="36053415" v_src_system_id="XYZ"
status_date="2000-01-31" ref_date="2021-03-31" ead_percent="0.00365316" accounting_standard="IFRS" party_default_status_cd="NOTDFLT" party_id="36113477" v_src_system_id="ABC"

Maybe using `match` to sieve out the decoration is simpler than iterating over fields. — FelixJN, Jan 12 '22 at 17:09
@FelixJN it's a bit harder to use match() robustly and extensibly because then you're dealing with partial regexp matching by default instead of full-word string matching so you have to consider regexp metachars and substrings and if you wanted to extend the logic to handle multiple targets then the obvious choice of adding into the regexp fails if/when the targets appear in a different order. It's not a big deal but IMHO the loop with full-word string comparisons is just clearer and simpler. — Ed Morton, Jan 12 '22 at 17:12

score 0 · Answer 3 · answered Jan 12 '22 at 17:07

With GNU awk (gawk) one may specify how arrays are traversed. The following saves each line in an array using party_id=XYZ as respective index and then returns the array sorted by said indices. Limited by RAM for very large files.

awk '{match($0,/party_id=[^ ]*/,$0,id) ; arr[id[0]]=$0}
     END {PROCINFO["sorted_in"]="@ind_str_asc"
          for (i in arr) {print arr[i]}
     }' infile.txt

How to sort a file based on key name instead of its position in unix?

3 Answers3

Linked