I see Compare two string as numeric value and I can look at the source for the sort
command-line utility. But to my surprise, I haven't found in Boost or StackOverflow a reference implementation of a generic string comparison of what man sort
calls --version-sort
– sorting strings with embedded numbers. I understand there are locale issues if we want to do full numeric sort with signs, separators, and decimals, but version sort seems much easier (and addresses the common case sequential file names that aren't zero-padded). Is there a common implementation of something like this?
If not, A naive generic implementation supporting just unsigned integers seems like it would be
- Both empty => equal.
- Find the first digits in each range.
std::lexicographical_compare_three_way
the corresponding leading ranges.- If they aren't equal, we are done.
- If they are equal, find the end of the digit spans.
- Compare the corresponding digit spans numerically.
- Recursively compare the remainder of the ranges.
Where the digit-span portion is something like
#include <algorithm>
#include <cassert>
#include <cctype> // For std::isdigit
#include <string_view>
struct is_digit_t {
// See https://en.cppreference.com/w/cpp/string/byte/isdigit
[[nodiscard]] bool operator()(unsigned char c) const {
return std::isdigit(c);
}
[[nodiscard]] bool operator()(const std::integral auto& c) const {
if (c != static_cast<std::decay_t<decltype(c)>>(static_cast<unsigned char>(c))) {
return false; // Has to be itself as an unsigned char.
}
return (*this)(static_cast<unsigned char>(c));
}
};
template <typename It0, typename It1, typename IsDigit = is_digit_t>
std::strong_ordering compare_strings_containing_only_digits_three_way(
It0 b0, It0 e0,
It1 b1, It1 e1,
[[maybe_unused]] IsDigit is_digit = {})
{
assert(std::all_of(b0, e0, is_digit));
assert(std::all_of(b1, e1, is_digit));
// Skip leading zeros:
auto nonzero = [](const auto& c) { return c != '0'; };
const auto bnz0 = std::find_if_not(b0, e0, nonzero);
const auto bnz1 = std::find_if_not(b1, e1, nonzero);
const auto s0 = std::distance(bnz0, e0);
const auto s1 = std::distance(bnz1, e1);
if (s0 != s1) {
return s0 < s1 ? std::strong_ordering::less : std::strong_ordering::greater;
}
// Same number of digits => lexicographical compare is numerical compare:
const auto numerical = std::lexicographical_compare_three_way(b0, e0, b1, e1);
if (numerical != 0) {
return numerical;
}
// Tiebreaker: "1" < "01" < "001" < "2":
return std::lexicographical_compare_three_way(b0, bnz0, b1, bnz1);
}
template <typename It0, typename It1, typename IsDigit = is_digit_t>
std::strong_ordering compare_strings_containing_unsigned_integers_three_way(
It0 b0, It0 e0,
It1 b1, It1 e1,
IsDigit is_digit = {})
{
if (b0 == e0 && b1 == e1) { // Totally empty => equal.
return std::strong_ordering::equal;
}
const auto numStart0 = std::find_if(b0, e0, is_digit);
const auto numStart1 = std::find_if(b1, e1, is_digit);
if (const auto leadingCmp = std::lexicographical_compare_three_way(b0, numStart0, b1, numStart1);
leadingCmp != std::strong_ordering::equal) {
return leadingCmp; // Don't have to worry about the numbers.
}
const auto numEnd0 = std::find_if_not(numStart0, e0, is_digit);
const auto numEnd1 = std::find_if_not(numStart1, e1, is_digit);
if (auto numberCmp = compare_strings_containing_only_digits_three_way(numStart0, numEnd0, numStart1, numEnd1, is_digit);
numberCmp != std::strong_ordering::equal) {
return numberCmp; // Number-part tie-broke it.
}
// Recursively deal with everything after the first matched numbers:
return compare_strings_containing_unsigned_integers_three_way(numEnd0, e0, numEnd1, e1);
}
https://godbolt.org/z/eEKbMj9vj
So...
- Does this seem right?
- Is there a cleaner way?
- Are there really no Boost algorithms for this?