4

I am working on a Bash scripting project in which I need to delete one of two files if they have identical content. I should delete the one which comes last in an alphabetical sort and in the example output my professor has provided, apple.dat is deleted when the choices are apple.dat and Apple.dat.

if [[ "apple" > "Apple" ]]; then
    echo apple
else
    echo Apple
fi

prints Apple

echo $(echo -e "Apple\napple" | sort | tail -n1)

prints Apple

The ASCII value of a is 97 and A is 65, why is the test saying A is greater?

The weird thing is that I get opposite results with the older syntax:

if [ "apple" \> "Apple" ]; then
    echo apple
else
    echo Apple
fi

prints apple

and if we try to use the \> in the [[ ]] syntax, it is a syntax error.

How can we correct this for the double bracket syntax? I have tested this on the school Debian server, my local machine, and my Digital Ocean droplet server. On my local Ubuntu 20.04 and on the school server I get the output described above. Interestingly, on my Digital Ocean droplet which is an Ubuntu 20.04 server, I get "apple" with both double and single bracket syntax. We are allowed to use either syntax, double bracket or the single bracket actual test call, however I prefer using the newer double bracket syntax and would rather learn how to make this work than to convert my mostly finished script to the older more POSIX compliant syntax.

  • 3
    What locale are you using (on your computer, the school Debian server, and the DO droplet)? You can use the `locale` command to find out. – Gordon Davisson Feb 21 '21 at 02:17
  • Local machine and school server are using en_US.UTF-8 as I expected, while my droplet is using C.UFT-8. – Astral Axiom Feb 21 '21 at 02:34
  • 2
    Ah, so as I am reading here: [StackOverflow locale question](https://stackoverflow.com/questions/55673886/what-is-the-difference-between-c-utf-8-and-en-us-utf-8-locales#55693338) that makes a difference in alphabetical sorting – Astral Axiom Feb 21 '21 at 02:41
  • I just wrote a short test script in which I do "LANG=C.UTF-8" before the conditional and then "LANG=en_us.UTF-8" just after it. This works, but I am not sure it is a good practice to do so. I suppose I could save the current LANG value into a variable and reset it from there as oppose to explicitly setting it to en_us.UTF-8 at after the conditional check for compatibility across machines. – Astral Axiom Feb 21 '21 at 02:49
  • 3
    Shell and environment variables are local to a process (although env variables get inherited by subprocesses). Since scripts run as subprocesses, changing `LANG` in a script won't affect the parent shell. Unless you need to reset it for something later in the script, don't worry about resetting it. – Gordon Davisson Feb 21 '21 at 02:54
  • Thank you @GordonDavisson your comment lead to the solution to this : ) – Astral Axiom Feb 21 '21 at 03:32
  • I cannot accept your comment as an answer. If you post it as an answer I will make it the accepted answer : ) – Astral Axiom Feb 21 '21 at 03:37
  • 2
    I'd add your own answer (or accept Léa's), since I don't know the full solution you settled on (I just provided some pointers to finding it). IMO the bit about locality of variables is appropriate as a comment. – Gordon Davisson Feb 21 '21 at 04:34
  • I will accept Lea's then. Thanks for the suggestion, I was unsure of the etiquette. My own answer would only be a variation of that. – Astral Axiom Feb 21 '21 at 04:47

3 Answers3

6

Hints:

$ (LC_COLLATE=C; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple

but:

$ (LC_COLLATE=C; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
Apple

The difference is that the Bash specific test [[ ]] uses the locale collation's rules to compare strings. Whereas the POSIX test [ ] uses the ASCII value.

From bash man page:

When used with [[, the < and > operators sort lexicographically using the current locale.

When used with test or [, the < and > operators sort lexicographically using ASCII ordering.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Léa Gris
  • 17,497
  • 4
  • 32
  • 41
  • Thank you, that is more concise than using LANG : ) – Astral Axiom Feb 21 '21 at 04:13
  • Thanks for that! Before posting the question, I had searched the man page for alphabetical and not found anything useful. Should have searched for lexicographical instead : ) Thanks so much to both of you Gordon Davison and @LéaGris for all your help. – Astral Axiom Feb 21 '21 at 06:13
0

I have come up with my own solution to the problem, however I must first thank @GordonDavisson and @LéaGris for their help and for what I have learned from them as that is invaluable to me.

No matter if computer or human locale is used, if, in an alphabetical sort, apple comes after Apple, then it also comes after Banana and if Banana comes after apple, then Apple comes after apple. So I have come up with the following:

# A function which sorts two words alphabetically with lower case coming after upper case.
# The last word in the sort will be printed twice to demonstrate that this works for both
# the POSIX compliant single bracket test call and the newer double bracket condition
# syntax.
# arg 1: One of two words to sort
# arg 2: One of two words to sort
# Return: 0 upon completion, 1 if incorrect number of args is given
sort_alphabetically() {
    [ $# -ne 2 ] && return 1

    word_1_val=0
    word_2_val=0

    while read -n1 letter; do
        (( word_1_val += $(printf '%d' "'$letter") ))
    done < <(echo -n "$1")

    while read -n1 letter; do
        (( word_2_val += $(printf '%d' "'$letter") ))
    done < <(echo -n "$2")

    if [ $word_1_val -gt $word_2_val ]; then
        echo $1
    else
        echo $2
    fi

    if [[ $word_1_val -gt $word_2_val ]]; then
        echo $1
    else
        echo $2
    fi

    return 0
}

sort_alphabetically "apple" "Apple"
sort_alphabetically "Banana" "apple"
sort_alphabetically "aPPle" "applE"

prints:

apple
apple
Banana
Banana
applE
applE

This works using process substitution and redirecting the output into the while loop to read one character at a time and then using printf to get the decimal ASCII value of each character. It is like creating a temporary file from the string which will be automatically destroyed and then reading it one character at a time. The -n for echo means the \n character, if there is one from user input or something, will be ignored.

From bash man pages:

Process Substitution

Process substitution allows a process's input or output to be referred to using a filename. It takes the form of <(list) or >(list). The process list is run asynchronously, and its input or output appears as a filename. This filename is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list. Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files.

When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

from stackoverflow post about printf:

If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.

Note: process substitution is not POSIX compliant, but it is supported by Bash in the way stated in the bash man page.


UPDATE: The above does not work in all cases!


The above solution works in many cases however we get some anomalies.

first word second word last alphabetically
apple Apple apple correct
Apple apple apple correct
apPLE Apple Apple incorrect
apple Banana Banana correct
apple BANANA apple incorrect

The following solution gets the results that are needed:

#!/bin/bash

sort_alphabetically() {
    [ $# -ne 2 ] && return 1

    local WORD_1="$1"
    local WORD_2="$2"
    local WORD_1_LOWERED="$(echo -n $1 | tr '[:upper:]' '[:lower:]')"
    local WORD_2_LOWERED="$(echo -n $2 | tr '[:upper:]' '[:lower:]')"

    if [ $(echo -e "$WORD_1\n$WORD_2" | sort | tail -n1) = "$WORD_1" ] ||\
       [ $(echo -e "$WORD_1_LOWERED\n$WORD_2_LOWERED" | sort | tail -n1) =\
         "$WORD_1_LOWERED" ]; then

        if [ "$WORD_1_LOWERED" = "$WORD_2_LOWERED" ]; then

            ASCII_VAL_WORD_1=0
            ASCII_VAL_WORD_2=0
            read -n1 FIRST_CHAR_1 < <(echo -n "$WORD_1")
            read -n1 FIRST_CHAR_2 < <(echo -n "$WORD_2")

            while read -n1 character; do
                (( ASCII_VAL_WORD_1 += $(printf '%d' "'$character") ))
            done < <(echo -n $WORD_1)
            
            while read -n1 character; do
                (( ASCII_VAL_WORD_2 += $(printf '%d' "'$character") ))
            done < <(echo -n $WORD_2)
            
            if [ $ASCII_VAL_WORD_1 -gt $ASCII_VAL_WORD_2 ] &&\
               [ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then

                echo "$WORD_1"
            elif [ $ASCII_VAL_WORD_2 -gt $ASCII_VAL_WORD_1 ] &&\
                 [ "$FIRST_CHAR_2" \> "$FIRST_CHAR_1" ]; then

                echo "$WORD_2"
            elif [ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then
                echo "$WORD_1"
            else
                echo "$WORD_2"
            fi
        else
            echo "$WORD_1"
        fi
    else
        echo $WORD_2
    fi

    return 0
}

sort_alphabetically "apple" "Apple"
sort_alphabetically "Apple" "apple"
sort_alphabetically "apPLE" "Apple"
sort_alphabetically "Apple" "apPLE"
sort_alphabetically "apple" "Banana"
sort_alphabetically "apple" "BANANA"

exit 0

prints:

apple
apple
apPLE
apPLE
Banana
BANANA
-2

Change your syntax. if [[ "Apple" -gt "apple" ]] works as expected.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Dennis
  • 1,071
  • 2
  • 17
  • 38
  • Oh, I was under the impression that the -gt, -eq type operators are only for numerical comparisons. – Astral Axiom Feb 21 '21 at 14:34
  • Just because comparing strings as integers gives you the desired result in this case doesn't mean it "works as expected" ... that's an accident. – tink Feb 26 '21 at 17:02