3

I am doing a lot of parsing/processing, where leading/trailing whitespace and case insensitivity is given. So I made a basic char trait for std::basic_string(see below) to save myself some work.

The trait is not working, the problem is that basic_string's compare calls the traits compare and if evaluated to 0 it returns the difference in sizes. In basic_string.h it says ...If the result of the comparison is nonzero returns it, otherwise the shorter one is ordered first. Looks like they explicitly don't want me to do this...

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0? And, is there any workaround or do I have to roll my own string?

#include <cstring>
#include <iostream>

namespace csi{
template<typename T>
struct char_traits : std::char_traits<T>
{
    static int compare(T const*s1, T const*s2, size_t n){
        size_t n1(n);
        while(n1>0&&std::isspace(*s1))
            ++s1, --n1;
        while(n1>0&&std::isspace(s1[n1-1]))
            --n1;
        size_t n2(n);
        while(n2>0&&std::isspace(*s2))
            ++s2, --n2;
        while(n2>0&&std::isspace(s2[n2-1]))
            --n2;
        return strncasecmp(static_cast<char const*>(s1),
                           static_cast<char const*>(s2),
                           std::min(n1,n2));
    }
};
using string = std::basic_string<char,char_traits<char>>;
}

int main()
{
    using namespace csi;
    string s1 = "hello";
    string s2 = " HElLo ";
    std::cout << std::boolalpha
              << "s1==s2" << " " << (s1==s2) << std::endl;
}
  • 2
    Not the answer, but `std::isspace(x)` where `x` is `char` *has* to be written as `std::isspace((unsigned char)x)`. Otherwise you get [undefined behavior for negative character codes](https://en.cppreference.com/w/cpp/string/byte/isspace). – HolyBlackCat Oct 28 '18 at 11:13
  • @HolyBlackCat Yes ofc, thank you :) –  Oct 28 '18 at 11:38

2 Answers2

0

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0?

That's simply how basic_string::compare() is defined.

And, is there any workaround or do I have to roll my own string?

It seems that your custom char_traits have to implement:

  • length(), returning length of the trimmed part of a string, and

  • move() and copy(), for copying that trimmed part


However, there's a potential problem which cannot be solved using custom traits. basic_string has constructors like basic_string(const CharT* s, size_type count, Allocator& alloc), or method overloads like assign or compare which take a C string and its length - in those cases Traits::length() won't be called. If anyone uses one of those methods, the string might contain trailing whitespaces or try to access characters beyond the end of the source string.

To solve this, it's possible to do something like this:

class TrimmedString
{
public:
    // expose only "safe" methods:
    void assign(const char* s) { m_str.assign(s); }

private:
    std::basic_sttring<char, CustomTraits> m_str;
};

Or this (might be simpler):

class TrimmedString : private std::basic_string<char, CustomTraits>
{
public:
    using BaseClass = std::basic_string<char, CustomTraits>; // for readability

    // make "safe" method public
    using BaseClass::length;
    using BaseClass::size;
    // etc.

    // wrappers for methods with "unsafe" overloads
    void assign(const char* s) { BaseClass::assign(s); }
};
joe_chip
  • 2,468
  • 1
  • 12
  • 23
  • Thank you for the answer. If `length()` returns the trimmed size how would I deal with leading whitespaces? E.g. `" hello "` -> [following your approach] -> `" hell"`. I don't see a way of incrementing `basic_string` data pointer via traits :/. Even If i would accomplish it somehow via `length()`, I think I would mess up a lot like `basic_string`'s append etc. or not? –  Oct 28 '18 at 11:34
  • @OZ17 that's why you need to implement `copy` and `move` too.these methods would just copy the middle part, ignoring leading and trailing parts of the source. – joe_chip Oct 28 '18 at 11:39
  • I am struggling a bit with it... If I change `length()`,`copy()`,`move()` etc. inside traits I gonna have a lot of "unsafe" `basic_string` methods `begin`,`insert`,`append`... I basically end up writing the `std::basic_string` class. –  Oct 28 '18 at 12:01
  • `begin()` is safe, the problem here are methods which take C strings and length, like overloads of `assign` etc. The idea in my example was to wrap `std::basic_string` so that they are not public. This way, it's not necessary to implement them - you just need one line wrappers which call methods from `basic_string`. – joe_chip Oct 28 '18 at 15:39
0

Converting data that has more than one possible representation into a "standard" or "normal" form is called canonicalization. With text it usually means removal of accents, cases, trimming white-space-characters and/or format-characters.

If canonicalization is done under the hood during each compare then it is fragile. For example how you test that it was done correctly both to s1 and s2? Also it is inflexible, for example you can not display its result or cache it for next compare. So it is both more robust and efficient to do that as explicit canonicalization step.

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0?

Traits compare is required to compare only n characters, so when you compare "hellow" and "hello" what it should return? It should return 0. You are in defective situation if you somehow ignore that n because the traits should work with std::string_view that is not zero-terminated. If the size compare is dropped then "hellow" and "hello" would compare equal that you likely don't want.

Öö Tiib
  • 10,809
  • 25
  • 44
  • Thank you!!! Took me a while to think through this again... The only reasonable way my trait would have worked is, if I could do this "shorter one" inside the traits with trimmed lengths instead. But its certainly not doable, because i can't override `basic_string`'s compare method. On the other hand, touching trait's `lenght()` is just to much of a mess. I stick to explicit trimming. Thank you again!!! –  Oct 28 '18 at 13:32