1

I can't seem to find any way of sorting a word based on its characters in awk. For example if the word is "hello" then its sorted equivalent is "ehllo". how to achieve this in awk ?

Abhi
  • 116
  • 7
  • 5
    This sounds like an XY problem See: [What is the XY problem?](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Awk is not strong on sorting and even less so within a string. Why have you chosen awk? If that isn't a hard requirement - you are better served using another tool. – David C. Rankin Sep 14 '21 at 05:21
  • I have started learning shell scripting a few days ago, can you recommend the tool to sort a string ? I have to check if a particular string is a permutation of another string. I thought of sorting both and then comparing letter by letter, but could not find anything to sort the string. – Abhi Sep 14 '21 at 05:23
  • 2
    For non-awk based solutions: https://stackoverflow.com/questions/2373874/how-to-sort-characters-in-a-string – Sundeep Sep 14 '21 at 06:23
  • This site exists to help people with their code. [edit] your question to include a [mcve] with concise, testable sample input and expected output plus your attempt to solve the problem yourself and then we can help you. – Ed Morton Sep 14 '21 at 16:30

5 Answers5

4

With GNU awk for PROCINFO[], "sorted_in" (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning) and splitting with a null separator resulting in an array of chars:

$ echo 'hello' |
awk '
    BEGIN { PROCINFO["sorted_in"]="@val_str_asc" }
    {
        split($1,chars,"")
        word = ""
        for (i in chars) {
            word = word chars[i]
        }
        print word
    }
'
ehllo

$ echo 'hello' | awk -v ordr='@val_str_asc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ehllo

$ echo 'hello' | awk -v ordr='@val_str_desc' 'BEGIN{PROCINFO["sorted_in"]=ordr} {split($1,chars,""); word=""; for (i in chars) word=word chars[i]; print word}'
ollhe
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

Another option is a Decorate-Sort-Undecorate with sed. Essentially, you use sed to break "hello" into one character per-line (decorating each character with a newline '\n') and pipe the result to sort. You then use sed to do the reverse (undecorate each line by removing the '\n') to join the lines back together.

printf "hello" | sed 's/\(.\)/\1\n/g' | sort | sed '{:a N;s/\n//;ta}'
ehllo

There are several approaches you can use, but this one is shell friendly, but the behavior requires GNU sed.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
1

This would be more doable with gawk, which includes the asort function to sort an array:

awk 'BEGIN{FS=OFS=ORS=""}{split($0,a);asort(a);for(i in a)print a[i]}'<<<hello

This outputs:

ehllo

Demo: https://ideone.com/ylWQLJ

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • This is wrong, idk why it's the accepted answer. I guess maybe it produced the expected output for some sample input and the OP thought that it'd always do that but it won't - if that happened it was just luck. – Ed Morton Sep 14 '21 at 16:28
  • @EdMorton Hi, thanks for your insight. Could you please tell why it would fail ? – Abhi Sep 15 '21 at 04:12
  • @Abhi `for(i in a)` by default visits the array indices in implementation-dependent (often hash) order (see https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array) so you can have input like `bca` then run `asort()` and it becomes `abc` then run `for (i in a)` and get output of `cab` or `bca` or `acb` or anything else. Even if you get the output you want now, you can get output you don't want for other input or for a different version of the tool or when run on a different platform or..... – Ed Morton Sep 15 '21 at 12:03
  • @EdMorton I get your point now. Thanks for the clarification. I read the link that you had provided and for `gawk`, I found that it can forced to iterate in numeric order by setting `PROCINFO["sorted_in"]`. Though I don't exactly know how to do it here. – Abhi Sep 15 '21 at 14:19
  • @Abhi I already provided an answer that does exactly that, see https://stackoverflow.com/a/69192721/1745001. – Ed Morton Sep 15 '21 at 14:20
1

You need to write a function to sort letters in a word (see : https://www.gnu.org/software/gawk/manual/html_node/Join-Function.html):

function siw(word,        result, arr, arrlen, arridx) {
    split(word, arr, "")
    arrlen = asort(arr)
    for (arridx = 1; arridx <= arrlen; arridx++) {
        result = result arr[arridx]
    }
    return result
}

And define a sort sub-function to compare two words (see : https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html):

function compare_by_letters(i1, v1, i2, v2,        left, right) {
    left  = siw(v1)
    right = siw(v2)
    if (left < right)
        return -1
    else if (left == right)
        return 0
    else
        return 1
}

And use this function with awk sort function:

asort(array_test, array_test_result, "compare_by_letters")

Then, the sample program is:

function siw(word,        result, arr, arrlen, arridx) {
    result = hash_word[word]
    if (result != "") {
        return result
    }
    split(word, arr, "")
    arrlen = asort(arr)
    for (arridx = 1; arridx <= arrlen; arridx++) {
        result = result arr[arridx]
    }
    hash_word[word] = result
    return result
}

function compare_by_letters(i1, v1, i2, v2,        left, right) {
    left  = siw(v1)
    right = siw(v2)
    if (left < right)
        return -1
    else if (left == right)
        return 0
    else
        return 1
}

{
    array_test[i++] = $0
}

END {
    alen = asort(array_test, array_test_result, "compare_by_letters")
    for (aind = 1; aind <= alen; aind++) {
        print array_test_result[aind]
    }
}

Executed like this:

echo -e "fail\nhello\nborn" | awk -f sort_letter.awk

Output:

fail
born
hello

Of course, if you have a big input, you could adapt siw function to memorize result for fastest compute:

function siw(word,        result, arr, arrlen, arridx) {
    result = hash_word[word]
    if (result != "") {
        return result
    }
    split(word, arr, "")
    arrlen = asort(arr)
    for (arridx = 1; arridx <= arrlen; arridx++) {
        result = result arr[arridx]
    }
    hash_word[word] = result
    return result
}
Arnaud Valmary
  • 2,039
  • 9
  • 14
0

here's a very unorthodox method for a quick-n-dirty approach, if you really want to sort "hello" into "ehllo" :

mawk/mawk2/gawk 'BEGIN { FS="^$" 

        # to make it AaBbCc… etc;  chr(65) = ascii "A"

        for (x = 65; x < 91; x++) {

            ref = sprintf("%s%c%c",ref, x, x+32) 

   } } /^[[:alpha:]]$/ { print } /[[:alpha:]][[:alpha:]]+/ { 
   
             # for gawk/nawk, feel free to change 
             # that to /[[:alpha:]]{2,}/
             # the >= 2+ condition is to prevent wasting time
             # sorting single letter words "A" and "I"

      s=""; x=1; len=length(inp=$0);

      while ( len && (x<53) ) {  
         if (inp~(ch = substr(ref,x++,1))) {
            while ( sub(ch,"",inp) ) {
                   s  = s ch; 
                 len -= 1   ;
      } } }

    print s }'

I'm aware it's an extremely inefficient way of doing selection sort. The potential time-savings stem from instant loop ending the moment all letters are completed, instead of iterating all 52 letters everytime. The downside is that it doesn't pre-profile the input

(e.g. if u detect that this row is only lower-case, then u can speed it up with a lowercase only loop instead)

The upside is that it eliminates the need for custom-functions, eliminate any gawk dependencies, and also eliminate the need to split every row into an array (or every character into its own field)

i mean yes technically one can set FS to null string thus automatically becomes having NF as the string length. But at times it could be slow if input is a bit large. If you need unicode support, then a match()-based approach is more desirable.

  • added (x<53) condition to prevent run-away infinite loops in case input isn't pure ASCII letters
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11