12

I'm currently sorting by the std::string < operator. The problem with it is that:

30 < 9. The 30 shows up before the 9 since 3 < 9, Windows 9x had this issue to. How could I go about sorting them numerically so that "30 Foxes" shows up after "9 dogs". I should also add that I'm using utf 8 encoding.

Thanks

jmasterx
  • 52,639
  • 96
  • 311
  • 557
  • possible duplicate of [How to implement a natural sort algorithm in c++?](http://stackoverflow.com/questions/642213/how-to-implement-a-natural-sort-algorithm-in-c) – ephemient Jan 07 '11 at 06:21
  • This depends on the use-case, of course, but for filenames I actually preferred the pre-XP way. I hate when the software tries to be overly smart, because then it starts to mess things up... try sorting a list of hexadecimal filenames in >=XP. – Yakov Galka Apr 28 '13 at 14:13

4 Answers4

13

You can create a custom comparison function to use with std::sort. This function would have to check if the string begins with a numeric value. If it does, convert the numeric part of each string to an int using some mechanism like a stringstream. Then compare the two integer values. If the values compare equally, compare the non-numeric part of the strings lexicographically. Otherwise, if the strings don't contain a numeric part, simply compare the two strings lexicographically as normal.

Basically, something like the following (untested) comparison function:

bool is_not_digit(char c)
{
    return !std::isdigit(c);
}

bool numeric_string_compare(const std::string& s1, const std::string& s2)
{
    // handle empty strings...

    std::string::const_iterator it1 = s1.begin(), it2 = s2.begin();

    if (std::isdigit(s1[0]) && std::isdigit(s2[0])) {
        int n1, n2;
        std::stringstream ss(s1);
        ss >> n1;
        ss.clear();
        ss.str(s2);
        ss >> n2;

        if (n1 != n2) return n1 < n2;

        it1 = std::find_if(s1.begin(), s1.end(), is_not_digit);
        it2 = std::find_if(s2.begin(), s2.end(), is_not_digit);
    }

    return std::lexicographical_compare(it1, s1.end(), it2, s2.end());
}

And then...

std::sort(string_array.begin(), string_array.end(), numeric_string_compare);

EDIT: Of course, this algorithm is only useful if you're sorting strings where the numeric portion appears at the beginning of the string. If you're dealing with strings where the numeric portion can appear anywhere in the string, then you need a more sophisticated algorithm. See http://www.davekoelle.com/alphanum.html for more information.

Charles Salvia
  • 52,325
  • 13
  • 128
  • 140
  • +1 because your code deals with the case when one of them contains a number at its start but the other doesn't. – Khaled Alshaya Jan 07 '11 at 04:51
  • `atoi` is ideal for this purpose, you won't even need special code for strings that don't start with numbers (although they will sort to the beginning). Or `strtod` if you want to control that. – Ben Voigt Jan 07 '11 at 06:15
  • 1
    @Ben, yeah, `atoi` would be much more efficient than `std::stringstream`. – Charles Salvia Jan 07 '11 at 06:17
  • FWIW, I wrote a similar function that uses `strtoll` for the [stackoverflow question here](http://stackoverflow.com/questions/22179430/sorting-std-vector-of-strings-without-using-default-algorithm/22180271#22180271), it's less correct in another way though - assuming direct character comparison is the desired order when not comparing digits. – Tony Delroy Mar 04 '14 at 18:49
  • Converting to int (e.g., via atoi) assumes the string representation of the number can be successfully converted to an int. With long strings of digits, the conversion would overflow... – Waxrat Dec 10 '14 at 20:47
3

Here's a version that doesn't convert to integer and thus works for long strings of digits regardless of sizeof(int).

#include <cctype>
#include <cstddef>
#include <cstring>
#include <string>

int numcmp(const char *a, const char *aend, const char *b, const char *bend)
{
  for (;;) {
    if (a == aend) {
      if (b == bend)
        return 0;
      return -1;
    }
    if (b == bend)
      return 1;
    if (*a == *b) {
      ++a, ++b;
      continue;
    }
    if (!isdigit((unsigned char) *a) || !isdigit((unsigned char) *b))
      return *a - *b;

    // skip leading zeros in both strings
    while (*a == '0' && ++a != aend)
      ;
    while (*b == '0' && ++b != aend)
      ;

    // skip to end of the consecutive digits
    const char *aa = a;
    while (a != aend && isdigit((unsigned char) *a))
      ++a;
    std::ptrdiff_t alen = a - aa;

    const char *bb = b;
    while (b != bend && isdigit((unsigned char) *b))
      ++b;
    std::ptrdiff_t blen = b - bb;

    if (alen != blen)
      return alen - blen;

    // same number of consecutive digits in both strings
    while (aa != a) {
      if (*aa != *bb)
        return *aa - *bb;
      ++aa, ++bb;
    }
  }
}

int numcmp(const std::string& a, const std::string& b)
{
  return numcmp(a.data(), a.data() + a.size(),
                b.data(), b.data() + b.size());
}

int numcmp(const char *a, const char *b)
{
  return numcmp(a, a + strlen(a), b, b + strlen(b));
}
Waxrat
  • 2,075
  • 15
  • 13
  • `numcmp("0001", "001") == 0`, which means when you use `numcmp` as a key sorting function in a set, only one of these strings will be stored in the set. This is probably not intended. – Roland Illig Jul 29 '21 at 07:10
  • Yes, indeed. For all solutions to this problem, we'd need to decide what we want to happen if the strings are different but numcmp says they're equal – Waxrat Jul 30 '21 at 08:45
3

If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details.

Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.

kbjorklu
  • 1,338
  • 8
  • 10
2

This what worked for me (assuming no leading zeroes), i.e. the idea is that phonetic compare can be applied just to numbers with same number of digits.

    auto numeric_str_cmp = [](const std::string& a, const std::string& b) -> bool {
                                return (a.size() < b.size()) || (a.size() == b.size() && a < b);
                             };
    std::sort(numeric_strings.begin(), numeric_strings.end(), numeric_str_cmp);