22

When std::views::split() gets an unnamed string literal as a pattern, it will not split the string but works just fine with an unnamed character literal.

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string>
#include <string_view>

int main(void)
{
    using namespace std::literals;

    // returns the original string (not splitted)
    auto splittedWords1 = std::views::split("one:.:two:.:three", ":.:");
    for (const auto word : splittedWords1)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the splitted string
    auto splittedWords2 = std::views::split("one:.:two:.:three", ":.:"sv);
    for (const auto word : splittedWords2)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the splitted string
    auto splittedWords3 = std::views::split("one:two:three", ':');
    for (const auto word : splittedWords3)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    // returns the original string (not splitted)
    auto splittedWords4 = std::views::split("one:two:three", ":");
    for (const auto word : splittedWords4)
        std::cout << std::quoted(std::string_view(word));
    
    std::cout << std::endl;

    return 0;
}

See live @ godbolt.org.

I understand that string literals are always lvalues. But even though, I am missing some important piece of information that connects everything together. Why can I pass the string that I want splitted as an unnamed string literal whereas it fails (as-in: returns a range of ranges with the original string) when I do the same with the pattern?

khaos
  • 632
  • 3
  • 11

2 Answers2

33

String literals always end with a null-terminator, so ":.:" is actually a range with the last element of \0 and a size of 4.

Since the original string does not contain such a pattern, it is not split.

When dealing with C++20 ranges, I strongly recommend using string_view instead of raw string literals, which works well with <ranges> and can avoid the error-prone null-terminator issue.

康桓瑋
  • 33,481
  • 5
  • 40
  • 90
  • 18
    I can't express how much I hate that this is right and that the library can't handle this well without additional language support. – chris Oct 31 '22 at 08:11
  • 1
    I was hoping for a really complicated reason, not one that's completely obvious but only after you know the answer. – Retired Ninja Oct 31 '22 at 08:22
  • Explains my observation with `std::string` versus `std::string_view` for the separator. – Some programmer dude Oct 31 '22 at 09:01
  • Thanks! Like others have said, this is really sad. I actually had this in the back of my mind but considered it a too obvious common pitfall that would have never made it in the standard. I was clearly wrong. How did you come to know this or figure it out? It is my nature that I always have to understand how things work (which is actually not such a good trait to have). Any tips how one would approach a problem like the above to figure out what is going on? Debugging/decompiling STL code isn't very pleasant. – khaos Nov 01 '22 at 07:29
  • @chris It kinda makes perfect sense to me though. Considering the working examples are present and this is a new feature - I don't see the problem. – AnArrayOfFunctions Nov 01 '22 at 08:44
  • 3
    @AnArrayOfFunctions, I don't want to end up with this creating a bunch of noise, but my issue with it is that it's really easy to write a pattern that has an intuitive meaning (e.g., `":"` meaning to split on colons like every other language) and could reasonably pass code review unless someone's specifically looking for this, but in reality has a completely different meaning that you'd almost never want (split on colon followed by \0). Writing a string literal as a pattern for a split call is a common use case that needs extra decoration and will silently do the wrong thing without any. – chris Nov 02 '22 at 03:15
  • 1
    As a final note, my saving grace here is that this is a perfect clang-tidy check. – chris Nov 02 '22 at 03:17
  • @chris Ok so that's is a `std::views` not an `std::string_view` - C++ string literals always have been arrays with additional `\0` element (I think you can't even assign them to an array of less characters like in C - https://stackoverflow.com/questions/21407898/initializer-string-for-array-of-chars-is-too-long-error). But bottom line C++ strings are what they are and if you want to use an standard library string - just https://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s – AnArrayOfFunctions Nov 02 '22 at 08:04
  • I feel like if you don't like it you should just learn it and if you don't want to learn it or you feel it's not correct you should just switch language - maybe use JavaScript. – AnArrayOfFunctions Nov 02 '22 at 08:10
  • I personally don't use the `s` literal but instead simply use `std::string` whenever possible (you would likely not do an `std::views::split` on a string literal but rather on an object - make that object `std::string` or something). – AnArrayOfFunctions Nov 02 '22 at 08:14
  • On a second note I just checked - maybe the original question wasn't the best example - if the first arguments (aka the string being split) is an `std::string` - it does't seem to work also - which invalidates my case - but still this should be the only scenario fixed. – AnArrayOfFunctions Nov 02 '22 at 08:19
  • Not to extend this thread much but I would also add that `std::ranges::iota_view` also have similar problem where it requires the 2 arguments to be of the same type but doesn't do type conversion so it just errors out (if one is let's say an `int` and the other `size_t`). Or maybe I should start using JavaScript as well - oh well. – AnArrayOfFunctions Nov 02 '22 at 08:25
13

This answer is completely correct, I'd just like to add a couple additional notes that might be interesting.


First, if you use {fmt} for printing, it's a lot easier to see what's going on, since you also don't have to write your own loop. You can just write this:

fmt::print("{}\n", rv::split("one:.:two:.:three", ":.:"));

Which will output (this is the default output for a range of range of char):

[[o, n, e, :, ., :, t, w, o, :, ., :, t, h, r, e, e, ]]

In C++23, there will be a way to directly specify that this print as a range of strings, but that hasn't been added to {fmt} yet. In the meantime, because split preserves the initial range category, you can add:

auto to_string_views = std::views::transform([](auto sr){
    return std::string_view(sr.data(), sr.size());
});

And then:

fmt::print("{}\n", std::views::split("one:.:two:.:three", ":.:") | to_string_views);

prints:

["one:.:two:.:three\x00"]

Note the visibly trailing zero. Likewise, the next three attempts format as:

["one", "two", "three\x00"]
["one", "two", "three\x00"]
["one:two:three\x00"]

The fact that we can clearly see the \x00 helps track down the issue.


Next, consider the difference between:

std::views::split("one:.:two:.:three", ":.:")

and

"one:.:two:.:three" | std::views::split(":.:")

We typically consider these to be equivalent, but they're... not entirely. In the latter case, the library has to capture and stash these values - which involves decaying them. In this case, because ":.:" decays into char const*, that's no longer a valid pattern for the incoming string literal. So the above doesn't actually compile.

Now, it'd be great if it both compiled and also worked correctly. Unfortunately, it's impossible to tell in the language between a string literal (where you don't want to include the null terminator) and an array of char (where you want to include the whole array). So at least, with this latter formulation, you can get the wrong thing to not compile. And at least - "doesn't compile" is better than "compiles and does something wildly different from what I expected"?


Demo.

Barry
  • 286,269
  • 29
  • 621
  • 977
  • I kept fmtlib out of the question since I wanted to keep it as vanilla as possible. :-) But your remarks have been very insightful! Would you mind further explaining why your last example does not compile? Shouldn't both string literals end up as the same type? And is there a performance/memory penalty when using pipes instead of the single function call? – khaos Nov 01 '22 at 07:37
  • 1
    @khaos `std::views::split(":.:")` decays the literal, so it gets stored as a `char const*`. That's no longer a valid pattern for the incoming `char const[18]` - since it's neither comparable to one element (`char`) nor is it a range (`char const*` isn't a range at all). When you do `split("abc", "b")` directly, you don't have this extra step need to produce the partial split object, so `"b"` gets stashed directly (as a `ref_view` to `char const(&)[2]`). It's just a nuance of how this needs to be implemented in the library machinery. – Barry Nov 01 '22 at 16:33