10

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".

Below is a test case that currently fails:

#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>

template <typename T>
struct LexicographicSort {
  inline bool operator() (const T& lhs, const T& rhs) const{
    std::ostringstream s1,s2;
    s1 << toLower(lhs); s2 << toLower(rhs);
    bool less = s1.str() < s2.str();
    //Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
    std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
    return less;
  }

  inline std::string toLower(const std::string& str) const {
    std::string newString("");
    for (std::string::const_iterator charIt = str.begin();
         charIt!=str.end();++charIt) {
          newString.push_back(std::tolower(*charIt));
        }
        return newString;
      }
};


int main(void) {
  const std::string reference[5] = {"ab","B","c1","c2","c10"};
  std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));

  //Insert in reverse order so we know they get sorted
  std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());

  std::cout<<"Items:\n";
  std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
  std::vector<std::string> sortedStrings(strings.begin(), strings.end());
  assert(sortedStrings == referenceStrings);
}
Community
  • 1
  • 1
Walter Nissen
  • 16,451
  • 4
  • 26
  • 27
  • Is there a reason you are using a `set` and not just `sort`-ing a `vector`? – John Dibling May 06 '10 at 19:32
  • 3
    First, how would A1B2 sort relative to A2B1? I've never done this, but I would probably start by breaking your string into chunks. Text, Numbers, Text, Numbers, et cetera. Then, sort the same way you would any other data structure with multiple members, with the understanding that the numeric bits sort as numbers not as strings. – Dennis Zickefoose May 06 '10 at 19:34
  • @Dibling: No particular reason. @Zickefoose: I would sort (ascending) as: A1B2, A1B10, A2B1. I think you may well be right that I'll have to do some primitive lexing, but I'd prefer to avoid something error prone if I can help it. – Walter Nissen May 06 '10 at 19:44
  • It looks like that's exactly what the ICompare article suggests. Understand that you're going to have to do at least a rudimentary parse of the strings. A character by character comparison fails specifically because with numbers you need to scan ahead to the end of the number to know what the value is. – Dennis Zickefoose May 06 '10 at 19:51
  • What about `A10B2` and `AB4C`? In the question you are referring to, there is a strict format, but you appear to be asking about completely freeform strings? Any number is less than any character? – UncleBens May 06 '10 at 19:54
  • Related question: http://stackoverflow.com/questions/34518/natural-sorting-algorithm – Emile Cormier May 06 '10 at 19:55
  • 1
    Also check out: http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html – Emile Cormier May 06 '10 at 19:55
  • `natural_compare` would be a great addition to Boost's string algorithm library: http://www.boost.org/doc/libs/release/doc/html/string_algo.html – Emile Cormier May 06 '10 at 19:57
  • 3
    Unfortunately, this sort of comparison is error prone. When humans sort things, there's a lot of context sensitive information that a general purpose routine won't capture. When sorting usernames, for instance, numbers might be ignored outright. When looking at dollar amounts, 1.50-2 is probably two numbers [1.50] and [2], but when looking at software revisions it would just be one "number" [1.50-2]. – Dennis Zickefoose May 06 '10 at 20:00
  • 1
    Possible duplicate: http://stackoverflow.com/questions/642213/how-to-implement-a-natural-sort-algorithm-in-c – Eugen Constantin Dinca May 06 '10 at 20:02
  • [Here's a solution that might work](http://stackoverflow.com/a/33880554/3744681) – Jahid Nov 23 '15 at 21:02

4 Answers4

5

Is there any way to do with without writing a mini-parser?

Let someone else do that?

I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • Alright! Exactly what I was looking for, once I added case insensitivity. Replace the calculation of "less" above with 'bool less = doj::alphanum_less()(s1.str(), s2.str());' Thank you! – Walter Nissen May 06 '10 at 22:47
  • I used the exact same link to implement natural sort in Python, much easier though since Python integral are as big as one needs :) – Matthieu M. May 07 '10 at 06:56
  • Generally, links to documentation, function, a tool or library [should be accompanied by usage notes, a specific explanation of how the linked resource is applicable to the problem, or some sample code](http://meta.stackoverflow.com/a/251605/584192), or if possible all of the above. – Samuel Liew Feb 02 '19 at 02:19
2

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.

  • Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
  • Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
  • If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
  • If one string has run out of characters, it comes first.
  • Use strcoll to compare alphabetic subsequences within the current locale.
  • Use strtold to compare numeric subsequences within the current locale.
  • Repeat until finished with one or both strings.
  • Break ties with strcmp.

This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
0

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?

It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.

Jay
  • 26,876
  • 10
  • 61
  • 112
0

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.

The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).

From there, sort as normal.

pdehaan
  • 332
  • 2
  • 13