0

I see Compare two string as numeric value and I can look at the source for the sort command-line utility. But to my surprise, I haven't found in Boost or StackOverflow a reference implementation of a generic string comparison of what man sort calls --version-sort – sorting strings with embedded numbers. I understand there are locale issues if we want to do full numeric sort with signs, separators, and decimals, but version sort seems much easier (and addresses the common case sequential file names that aren't zero-padded). Is there a common implementation of something like this?

If not, A naive generic implementation supporting just unsigned integers seems like it would be

  1. Both empty => equal.
  2. Find the first digits in each range.
  3. std::lexicographical_compare_three_way the corresponding leading ranges.
  4. If they aren't equal, we are done.
  5. If they are equal, find the end of the digit spans.
  6. Compare the corresponding digit spans numerically.
  7. Recursively compare the remainder of the ranges.

Where the digit-span portion is something like

#include <algorithm>
#include <cassert>
#include <cctype> // For std::isdigit
#include <string_view>


struct is_digit_t {
    // See https://en.cppreference.com/w/cpp/string/byte/isdigit
    [[nodiscard]] bool operator()(unsigned char c) const {
         return std::isdigit(c);
    }
    [[nodiscard]] bool operator()(const std::integral auto& c) const {
        if (c != static_cast<std::decay_t<decltype(c)>>(static_cast<unsigned char>(c))) {
            return false; // Has to be itself as an unsigned char.
        }
        return (*this)(static_cast<unsigned char>(c));
    }
};

template <typename It0, typename It1, typename IsDigit = is_digit_t>
std::strong_ordering compare_strings_containing_only_digits_three_way(
    It0 b0, It0 e0, 
    It1 b1, It1 e1,
    [[maybe_unused]] IsDigit is_digit = {})
{
    assert(std::all_of(b0, e0, is_digit));
    assert(std::all_of(b1, e1, is_digit));    
    // Skip leading zeros:
    auto nonzero = [](const auto& c) { return c != '0'; };
    const auto bnz0 = std::find_if_not(b0, e0, nonzero);
    const auto bnz1 = std::find_if_not(b1, e1, nonzero);
    const auto s0 = std::distance(bnz0, e0);
    const auto s1 = std::distance(bnz1, e1);
    if (s0 != s1) {
        return s0 < s1 ? std::strong_ordering::less : std::strong_ordering::greater;
    }
    // Same number of digits => lexicographical compare is numerical compare:
    const auto numerical = std::lexicographical_compare_three_way(b0, e0, b1, e1);
    if (numerical != 0) {
        return numerical;
    }
    // Tiebreaker: "1" < "01" < "001" < "2":
    return std::lexicographical_compare_three_way(b0, bnz0, b1, bnz1);
}

template <typename It0, typename It1, typename IsDigit = is_digit_t>
std::strong_ordering compare_strings_containing_unsigned_integers_three_way(
    It0 b0, It0 e0, 
    It1 b1, It1 e1,
    IsDigit is_digit = {}) 
{
    if (b0 == e0 && b1 == e1) { // Totally empty => equal.
        return std::strong_ordering::equal;
    }
    const auto numStart0 = std::find_if(b0, e0, is_digit);
    const auto numStart1 = std::find_if(b1, e1, is_digit);

    
    if (const auto leadingCmp = std::lexicographical_compare_three_way(b0, numStart0, b1, numStart1);
        leadingCmp != std::strong_ordering::equal) {
        return leadingCmp; // Don't have to worry about the numbers.
    }
    const auto numEnd0 = std::find_if_not(numStart0, e0, is_digit);
    const auto numEnd1 = std::find_if_not(numStart1, e1, is_digit);
    if (auto numberCmp = compare_strings_containing_only_digits_three_way(numStart0, numEnd0, numStart1, numEnd1, is_digit); 
        numberCmp != std::strong_ordering::equal) {
        return numberCmp; // Number-part tie-broke it.
    }
    // Recursively deal with everything after the first matched numbers:
    return compare_strings_containing_unsigned_integers_three_way(numEnd0, e0, numEnd1, e1);
}

https://godbolt.org/z/eEKbMj9vj

So...

  • Does this seem right?
  • Is there a cleaner way?
  • Are there really no Boost algorithms for this?
Ben
  • 9,184
  • 1
  • 43
  • 56

1 Answers1

2

std::sort requires O(N·log(N)) comparisons. Compare that to O(N) transforming each string individually to a number that can be compared via built-in <.

Instead of embedding the transformation in the comparison you should consider to populate a std::vector< std::pair<int, std::string>> and sort that via a simple std::sort(v.begin(),v.end()) and eventually extract the original strings to get the desired output.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • That approach requires O(N) extra allocs in an effort to optimize for the case that there are numbers in the strings that affect the ordering. Also, other algorithms require an ordering, such as `std::min_element`. – Ben Oct 25 '21 at 19:25
  • @Ben I wouldn't discount this suggestion just because of the additional space, unless you are working in a memory-constrained environment or with very large orders of `N`. If you parse in-place for the sorting, then you are performing this computation `N*log(N)`times, which would be quite inefficient over performing the transformation once -- even with the cost of extra storage. The storage size would be fixed and the size known up-front. Even if the alloc cost is really that worrying, polymorphic allocators can help (though really, profile first) – Human-Compiler Oct 25 '21 at 19:46