7

I'm building a text parser that uses std::string as the core storage for strings.

I know this is not optimal and that parsers inside compilers use optimzed approaches for this. In my project I don't mind losing some performance in exchange for more clarity and easier maintenance.

At the beginning I read a huge text into memory and then I scan each character to build a ordered set of tokens, its a simple lexer. Currently I'm using std::string to represent the text of a token but I would like to improve this a bit by using a reference/pointer into the original text.

From what I have read it is a bad practice to return and hold to iterators and it is also a bad practice to refer to the std::string internal buffer.

Any suggestions on how to accomplish this in a "clean" way?

quamrana
  • 37,849
  • 12
  • 53
  • 71
  • 5
    Regarding iterators it is not a bad practice if you make sure they're not invalidated and used – Marco A. Jul 15 '14 at 15:15
  • If you are keeping the whole file in memory anyway, why not just memmap it? Will likely be more efficient. – Deduplicator Jul 15 '14 at 15:19
  • 2
    I'm probably missing something obvious, but couldn't you just return an int representing the token's byte-offset from the beginning of the string? ints are pretty efficient... – Jeremy Friesner Jul 15 '14 at 15:22
  • 3
    Take a look at [`string_view`](http://stackoverflow.com/questions/20803826/what-is-string-view). – Casey Jul 15 '14 at 15:28
  • I had the wrong idea that I should not return iterators to callers ouside the scope of the class that owns the string. In my case they will be valid until the end of the program. In this case I would say that an iterator is a nice replacement for an integer because of the abstraction. – Pedro Salgueiro Jul 15 '14 at 16:21

4 Answers4

10

There are proposals to add string_view to C++ in an upcoming standard.

A string_view is a non-owning iterable range over characters with many of the utilities and properties you'd expect of a string class, except you cannot insert/delete characters (and editing characters is often blocked in some subtypes).

I would advise trying that approach -- write your own (in your own utility namespace). (You should have your own utility namespace for reusable code snippets anyhow).

The core data is a pair of char* pr std::string::iterators (or const versions). If the user needs a null terminated buffer, a to_string method allocates one. I would start with non-mutable (const) character data. Do not forget begin and end: that makes your view iterable with for(:) loops.

This design has the danger that the original std::string has to persist long enough to outlast all of the views.

If you are willing to give up some performance for safety, have the view own a std::shared_ptr<const std::string> that it can move a std::string into, and as a first step move the entire buffer into it, and then start chopping/parsing it down. (child views make a new shared pointer to same data). Then your view class is more like a non-mutable string with shared storage.

The upsides to the shared_ptr<const> version include safety, longer lifetime of the views (there is no more lifetime dependency), and you can easily forward your const "substring" type methods to the std::string so you can write less code.

Downsides include possible incompatibility with incoming standard one1, and lower performance because you are dragging a shared_ptr around.

I suspect views and ranges are going to be increasingly important in modern C++ with the upcoming and recent improvements to the language.

boost::string_ref is apparently an implementation of a proposal to the C++1y standard.


1 however, given how simple it is to add capabilities in template metaprogramming, having a "resource owner" template argument to a view type might be a good design decision. Then you can have owning and non-owning string_views with otherwise identical semantics...

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
6

Some through here:

-Internal representation of the string live the same time that the string himself, if you save pointer or iterators to the string to use latter (ex: print reports, postprocessing etc...) to the scope of the string your would face invalid memory access. Normally in this type of processing the text live all the time of the process.
-Iterators is a good choices (for extreme performance and generality I suggest use of const raw pointer const char*, because the origin could be almost anything, string, buffer, mapped buffer, readed data from stream, etc...)
-A good practice is instead of copying the tokens, save a pair (token begin iterator, token end iterator) in a collection of tokens.
-It is imperative for performance trying not to make a lot of allocations (alloc is one of the most expensive operation in any language)

You could check lexertl (for more ideas or for use it): http://www.benhanson.net/lexertl.html and spirit (more complete): http://www.boost.org/doc/libs/release/libs/spirit/

NetVipeC
  • 4,402
  • 1
  • 17
  • 19
4

Returning and using iterators is not a bad practice. Of course assuming that you are not modifying the input buffer, but it does not look like you are.

Wojtek Surowka
  • 20,535
  • 4
  • 44
  • 51
0

I may be considered a heathen here but as long as you work on a const reference to the actual string then I don't see any reason not to use const char* into the string's data (as long as you're using c++11).

According to the c++11 standard the internal data of a std::string must be contiguous and no pointers can be invalidated unless the string is subjected to processes on a non-const reference.

21.4.1 basic_string general requirements

5 The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().

6 References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:

as an argument to any standard library function taking a reference to non-const basic_string as an argument.

Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

So rather than use s.data() use &s.begin() to get at the actual internal buffer.

NOTE: I am pretty sure these guarantees do not hold for previous versions of the standard.

Galik
  • 47,303
  • 4
  • 80
  • 117
  • Why do you think `s.data()` does not inherit those guarantees? (with the added benefit of being defined behavior for `size()==0`) – Yakk - Adam Nevraumont Jul 15 '14 at 16:41
  • @Yakk In c++11 `s.data()` may well have the same guarantees. I am not familiar with it enough to be certain. But as I understand it, the previous version of the standard did not require the internal representation of a `std::string` to be contiguous or null terminated. But the string returned from s.c_str() was. That meant that an implementation could present you with a c-string friendly copy of the real data. And `s.data()` is an synonym of `s.c_str()`. – Galik Jul 15 '14 at 17:05
  • This only works if you decode the string into utf-16 or utf-32, otherwise the pointer arithmetics would not work (an increase of one in the pointer might not be an increase of one in the char index). – Pedro Salgueiro Jul 16 '14 at 08:17
  • @Pedro Using a `const char*` works just as well as accessing the `std::string` through an `std::string::iterator` or the `operator[]`. None of the normal string methods will work in unicode if the string is filled with unicode data. Even `s.size();` will not tell you how many characters are in the string. The libraries that are there to process unicode data will likely work perfectly well from a `const char*` or else they will provide their own iterators that *will* work. – Galik Jul 16 '14 at 09:19