0

The title is pretty much it. If a standard C++ string with UTF-8 characters has no zero bytes does the scanning terminate at the end of the string defined by it's size? Conversely, if the string has a zero byte does scanning stop at that byte, or continue to the full length of the string? I've look at the Re2.h file and it does not seem to address this issue.

  • c++ strings do not use a null terminating byte, c-style strings do. See: http://stackoverflow.com/questions/11752705/does-string-contain-null-terminator – EdChum Jun 09 '14 at 07:29
  • Please show some sample source code addressing your issue – vz0 Jun 09 '14 at 07:31
  • 2
    @EdChum: C++11 strings *are* required to contain a \0 . http://en.cppreference.com/w/cpp/string/basic_string/data or the standard itself says so too – deviantfan Jun 09 '14 at 08:09
  • @deviantfan this is new news to me, thanks for the update – EdChum Jun 09 '14 at 08:11
  • @deviantfan They're not text, but they are allowed by the UTF-8 encoding scheme. Unicode encoding schemes support all values in the ranges `[0...0xD800)` and `[0xE000...0x110000)`. Although not all code points in those ranges have been assigned, 0x0000 is. – James Kanze Jun 09 '14 at 08:17
  • @JamesKanze: Well, 0 of Ascii/Unicode/... will be 0 in UTF-8 too. But else? And a std::string normally is meant to be something single-byte Ascii-like anyways, so I don´t get the point (?) – deviantfan Jun 09 '14 at 08:21
  • @deviantfan The point is simple: `std::string` is code-agnostic, and can contain UTF-8. Functions on `std::string` do _not_ terminate at `'\0'`, but at the end of the string. And `0x00` is a valid UTF-8 character. – James Kanze Jun 09 '14 at 08:24
  • @deviantfan Also, of course... While many of the classical string functions do assume a single byte encoding (and a one to one mapping between upper and lower case, which is usually false), modern libraries, like `boost::regex` and `re2` explicitly support UTF-8 (in some cases, at least, with regards to Boost). You can create regular expressions which match the null character, and you can have null characters in the middle of a string. There is simply nothing special about the null character _except_ when constructing a string from a `char const*`. – James Kanze Jun 09 '14 at 08:38

2 Answers2

0

A std::string containing UTF-8 characters can´t have 0-bytes a part of the text
(only as termination), because UTF-8 doesn´t allow 0´s anywhere.

And given you´re using something C++11-compliant, a terminating 0 is guaranteed
(doesn´t matter if you use data() or c_str(). And data is the original data, so...).
See http://en.cppreference.com/w/cpp/string/basic_string/data
or the standard (21.4.7.1/1 etc.).
=> The processing of a string will stop at the 0

deviantfan
  • 11,268
  • 3
  • 32
  • 49
  • 1
    Your first sentence is wrong, according to the Unicode standard. Unicode defines the code point 0x0000 as the control NULL, and the UTF-8 encoding format specifies how it is formatted in UTF-8 (as a single byte 0x00). – James Kanze Jun 09 '14 at 08:21
  • And of course, functions on `std::string` or in `` do _not_ stop at the 0. – James Kanze Jun 09 '14 at 08:22
  • As I said in the other comment, I don´t understand the problem. If you´re putting \s´s in the string intentionally, it will be a problem. But why should anyone do that... – deviantfan Jun 09 '14 at 08:25
  • 1
    The problem is that your answer is incorrect. `'\0'` is a character like any other in `std::string`, and is a legal Unicode and UTF-8 character. He's asking about `Re2.h`, whose interface uses `std::string`. Which means that the processing won't (or shouldn't) stop at 0. – James Kanze Jun 09 '14 at 08:34
0

The interface to Re2 seems to use std::string, which almost certainly means that it uses the begin and the end of the string, and that null characters are characters like any other. (The are, after all, defined in Unicode and in UTF-8.) Of course, '\0' is in the category control characters, so it won't match something like "\pL" (which matches a letter). But it should match "\pC". And of course, '\u0000' and other representations of the null character.

James Kanze
  • 150,581
  • 18
  • 184
  • 329