Case insensitive sorting of an array of strings

Question

Basically, I have to use selection sort to sort a string[]. I have done this part but this is what I am having difficulty with.

The sort, however, should be case-insensitive, so that "antenna" would come before "Jupiter". ASCII sorts from uppercase to lowercase, so would there not be a way to just swap the order of the sorted string? Or is there a simpler solution?

void stringSort(string array[], int size) {
    int startScan, minIndex;
    string minValue;

    for(startScan = 0 ; startScan < (size - 1); startScan++) {
        minIndex = startScan;
        minValue = array[startScan];

        for (int index = startScan + 1; index < size; index++) {
            if (array[index] < minValue) {
                minValue = array[index];
                minIndex = index;
            }
        }
        array[minIndex] = array[startScan];
        array[startScan] = minValue;
    }
}

@ElliottFrisch its a class assign so dont want to get in trouble — bsem, Oct 27 '15 at 23:35

Jonathan Mee · Accepted Answer · 2019-03-19T15:49:05.237

C++ provides you with sort which takes a comparison function. In your case with a vector<string> you'll be comparing two strings. The comparison function should return true if the first argument is smaller.

For our comparison function we'll want to find the first mismatched character between the strings after tolower has been applied. To do this we can use mismatch which takes a comparator between two characters returning true as long as they are equal:

const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});

To decide if the lhs is smaller than the rhs fed to mismatch we need to test 3 things:

Were the strings of unequal length
Was string lhs shorter
Or was the first mismatched char from lhs smaller than the first mismatched char from rhs

This evaluation can be performed by:

result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));

Ultimately, we'll want to wrap this up in a lambda and plug it back into sort as our comparator:

sort(foo.begin(), foo.end(), [](const unsigned char lhs, const unsigned char rhs){
    const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const unsigned char lhs, const unsigned char rhs){return tolower(lhs) == tolower(rhs);});

    return result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second));
});

This will correctly sort vector<string> foo. You can see a live example here: http://ideone.com/BVgyD2

EDIT:

Just saw your question update. You can use sort with string array[] as well. You'll just need to call it like this: sort(array, std::next(array, size), ...

@BennySaxeman Yeah see my edit. I'd just do: `sort(array, array + length, [](const auto& lhs, const auto& rhs){ const auto result = mismatch(lhs.cbegin(), lhs.cend(), rhs.cbegin(), rhs.cend(), [](const auto& lhs, const auto& rhs){return tolower(lhs) == tolower(rhs);}); return result.second != rhs.cend() && (result.first == lhs.cend() || tolower(*result.first) < tolower(*result.second)); });` sorry for the run on line! Note the change in the first two parameters. — Jonathan Mee, Oct 28 '15 at 03:32
If efficiency does not matter (as in this case) convert two strings to lowercase and compare result by `operator<` would be significantly simpler. — Slava, Oct 28 '15 at 12:55
@Slava Yeah, I think this algorithm is the most efficient way to handle it. But because sort is O(n log(n)) it will do log(n) extra conversions, as opposed to creating a temporary array of lowercased strings. But this algorithm takes advantage of the fact that `mismatch` short circuts, so it only calls `tolower` on letters that need to be compared. It'll depend upon your input set but this may have the smallest number of `tolowers` *and* it avoids the creation of temporaries (other than the two characters being compared.) — Jonathan Mee, Oct 28 '15 at 14:41

score 3 · Answer 2 · answered Oct 09 '17 at 21:05

#include <algorithm>
#include <vector>
#include <string>

using namespace std;    

void CaseInsensitiveSort(vector<string>& strs)
{
    sort(
        begin(strs),
        end(strs),
        [](const string& str1, const string& str2){
            return lexicographical_compare(
                begin(str1), end(str1),
                begin(str2), end(str2),
                [](const char& char1, const char& char2) {
                    return tolower(char1) < tolower(char2);
                }
            );
        }
    );
}

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. — baikho, Oct 09 '17 at 21:25

score 2 · Answer 3 · edited Jun 28 '18 at 15:45

I use this lambda function to sort a vectors of strings:

std::sort(entries.begin(), entries.end(), [](const std::string& a, const std::string& b) -> bool {
        for (size_t c = 0; c < a.size() and c < b.size(); c++) {
            if (std::tolower(a[c]) != std::tolower(b[c]))
                return (std::tolower(a[c]) < std::tolower(b[c]));
        }
        return a.size() < b.size();
    });

score 1 · Answer 4 · edited May 23 '17 at 12:32

Instead of the < operator, use a case-insensitive string comparison function.

C89/C99 provide strcoll (string collate), which does a locale-aware string comparison. It's available in C++ as std::strcoll. In some (most?) locales, like en_CA.UTF-8, A and a (and all accented forms of either) are in the same equivalence class. I think strcoll only compares within an equivalence class as a tiebreak if the whole string is otherwise equal, which gives a very similar sort order to a case-insensitive compare. Collation (at least in English locales on GNU/Linux) ignores some characters (like [). So ls /usr/share | sort gives output like

acpi-support
adduser
ADM_scripts
aglfn
aisleriot

I pipe through sort because ls does its own sorting, which isn't quite the same as sort's locale-based sorting.

If you want to sort some user-input arbitrary strings into an order that the user will see directly, locale-aware string comparison is usually what you want. Strings that differ only in case or accents won't compare equal, so this won't work if you were using a stable sort and depending on case-differing strings to compare equal, but otherwise you get nice results. Depending on the use-case, nicer than plain case-insensitive comparison.

FreeBSD's strcoll was and maybe still is case sensitive for locales other than POSIX (ASCII). That forum post suggests that on most other systems it is not case senstive.

MSVC provides a _stricoll for case-insensitive collation, implying that its normal strcoll is case sensitive. However, this might just mean that the fallback to comparing within an equivalence class doesn't happen. Maybe someone can test the following example with MSVC.

// strcoll.c: show that these strings sort in a different order, depending on locale
#include <stdio.h>
#include <locale.h>

int main()
{
    // TODO: try some strings containing characters like '[' that strcoll ignores completely.
    const char * s[] = { "FooBar - abc", "Foobar - bcd", "FooBar - cde" };
#ifdef USE_LOCALE
    setlocale(LC_ALL, ""); // empty string means look at env vars
#endif
    strcoll(s[0], s[1]);
    strcoll(s[0], s[2]);
    strcoll(s[1], s[2]);
    return 0;
}

output of gcc -DUSE_LOCALE -Og strcoll.c && ltrace ./a.out (or run LANG=C ltrace a.out):

__libc_start_main(0x400586, 1, ...
setlocale(LC_ALL, "")                        = "en_CA.UTF-8"   # my env contains LANG=en_CA.UTF-8
strcoll("FooBar - abc", "Foobar - bcd")      = -1
strcoll("FooBar - abc", "FooBar - cde")      = -2
strcoll("Foobar - bcd", "FooBar - cde")      = -1
  # the three strings are in order
+++ exited (status 0) +++

with gcc -Og -UUSE_LOCALE strcoll.c && ltrace ./a.out:

__libc_start_main(0x400536, ...
# no setlocale, so current locale is C
strcoll("FooBar - abc", "Foobar - bcd")      = -32
strcoll("FooBar - abc", "FooBar - cde")      = -2
strcoll("Foobar - bcd", "FooBar - cde")      = 32   # s[1] should sort after s[2], so it's out of order
+++ exited (status 0) +++

POSIX.1-2001 provides strcasecmp. The POSIX spec says the results are "unspecified" for locales other than plain-ASCII, though, so I'm not sure whether common implementations handle utf-8 correctly or not.

See this post for portability issues with strcasecmp, e.g. to Windows. See other answers on that question for other C++ ways of doing case-insensitive string compares.

Once you have a case-insensitive comparison function, you can use it with other sort algorithms, like C standard lib qsort, or c++ std::sort, instead of writing your own O(n^2) selection-sort.

As b.buchhold's answer points out, doing a case-insensitive comparison on the fly might be slower than converting everything to lowercase once, and sorting an array of indices. The lowercase-version of each strings is needed many times. std::strxfrm will transform a string so that strcmp on the result will give the same result as strcoll on the original string.

Are you saying that `strcoll` is case insensitive? Cause I'd never given it much thought, but if that's the case... — Jonathan Mee, Oct 28 '15 at 03:36
@JonathanMee: I'm nowhere near an expert in locales. I've only used `C` (POSIX ASCII) and `en_CA.UTF-8` locales. I assume it's typical for locales that use the Roman alphabet to do case-insensitive collation, as well as sorting accented characters with their unaccented versions. If you want to sort some user-input arbitrary strings into an order that the user will see directly, locale-aware string comparison is usually what you want. Other uses (like as part of a data structure) might care about the bytes and need to use `memcmp` or similar. — Peter Cordes, Oct 28 '15 at 05:55
`strcoll` *is* case sensitive, per: http://en.cppreference.com/w/cpp/string/byte/strcoll "Within an equivalence class, lowercase characters collate before their uppercase equivalents." In my testing I've found this to be true. This additionally strikes down `strxfrm` which behaves as `strcoll`. Leaving this answer with only the unfortunately platform dependent implementations :( — Jonathan Mee, Oct 28 '15 at 11:25
@JonathanMee: Actually, I think case and accent-or-not (i.e. comparing within equivalence class) only comes into play as a tiebreak after for strings that are equal when considering all members of an equivalence class as equal. This is great for sorting, but might or might not be what you want when testing for equality. I agree that this answer doesn't manage to have any great specific recommendations, just some pointers to things to check on, and a link to the main "c++ case-insensitive string compare" question. — Peter Cordes, Oct 28 '15 at 12:41

score 0 · Answer 5 · answered Oct 27 '15 at 22:58

You could call tolower on every character you compare. This is probably the easiest, yet not a great solution, becasue:

You look at every char multiple times so you'd call the method more often than necessary
You need extra care to handle wide-characters w.r.t to their encoding (UTF8 etc)

You could also replace the comparator by your own function. I.e. there will be some place where you compare something like stringone[i] < stringtwo[j] or charA < charB. change it to my_less_than(stringone[i], stringtwo[j]) and implement the exact ordering you want based.

another way would be to transform every string to lowercase once and create an array of pairs. then you base your comparisons on the lowercase value only, but you swap whole pairs so that your final strings will be in the right order as well.

finally, you can create an array with lowercase versions and sort this one. whenever you swap two elements in this one, you also swap in the original array.

note that all those proposals would still need proper handling of wide characters (if you need that at all)

Or add a layer of indirection while sorting: sort an array of indices by the `string` they index in your input array. Then undo the layer of indirection to get a sorted list of strings. That has lower swap overhead, but maybe slightly higher access overhead. — Peter Cordes, Oct 28 '15 at 01:45

Slava · Answer 6 · 2015-10-28T13:22:03.823

This solution is much simpler to understand than Jonathan Mee's and pretty inefficient, but for educational purpose could be fine:

std::string lowercase( std::string s )
{
   std::transform( s.begin(), s.end(), s.begin(), ::tolower );
   return s;
}

std::sort( array, array + length, 
           []( const std::string &s1, const std::string &s2 ) { 
               return lowercase( s1 ) < lowercase( s2 ); 
           } );

if you have to use your sort function, you can use the same approach:

    ....
    minValue = lowercase( array[startScan] );

    for (int index = startScan + 1; index < size; index++) {
        const std::string &tstr = lowercase( array[index] );
        if (tstr < minValue) {
            minValue = tstr;
            minIndex = index;
        }
    }
    ...

Case insensitive sorting of an array of strings

6 Answers6