53

I have a situation where I need to process large (many GB's) amounts of data as such:

  1. build a large string by appending many smaller (C char*) strings
  2. trim the string
  3. convert the string into a C++ const std::string for processing (read only)
  4. repeat

The data in each iteration are independent.

My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

Alternatively, could I use stringstreams or something similar to re-use a large buffer?

Edit: Thanks for the answers, for clarity, I think a revised question would be:

How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

Akusete
  • 10,704
  • 7
  • 57
  • 73

6 Answers6

30

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.


UPDATE (since I still see occasional upvotes on this answer):

C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.

puetzk
  • 10,534
  • 3
  • 28
  • 32
  • I think your solution is the best approach (chaning the processing code) unfortunatly in this situation its not an option. – Akusete Dec 12 '08 at 02:54
  • 1
    Is there a standard specified way to achive buffer reuse? I just dont want to rely on the implementation on a specific platform. – Akusete Dec 12 '08 at 02:56
  • Unless that processing code is a library function which doesn't use iterators nor strings, just a plain old `char*` + size. – SasQ Jun 11 '15 at 06:48
20

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

See this link for more information on the reserve function.

e.James
  • 116,942
  • 41
  • 177
  • 214
  • Your solution sounds like what im after, although, when you say "you can just use that same C++ string" do you mean using clear(), and then continuing to build the next string? – Akusete Dec 12 '08 at 02:47
  • This solution is the simplest for me to adopt, currently, although I'm unsure if the re-use is actually specified by the standard (although it does seem to work on my implementation) – Akusete Dec 12 '08 at 03:01
  • Using clear() should work. As far as I know, clear() does not free the memory used by the string, and so does not affect the space allocated by reserve(). – e.James Dec 12 '08 at 08:30
  • 1
    @Akusete @e.James: (Yes, I know it's 2.5 years later) As I read the relevant parts of the standard, clear() is defined in terms of erase(), and erase() specifically does _not_ dictate whether the block of memory in use after the call to erase() is the same as that in use _before_ the call to erase(), nor does it dictate that capacity() remain the same. See e.g. 21.3.3 and 21.3.5.5. Reasonable implementations certainly may function this way, but be sure to treat it only as a helpful optimization -- not a requisite for proper program functioning. – Nicholas Knight May 13 '11 at 15:08
  • @Nicholas Knight: Fair point. As always, the standard procedure is to make optimizations only after taking measurements and making sure that it is necessary. I wonder if I even knew that back in '08? `:)` – e.James May 16 '11 at 06:17
  • @jons34yp FYI, your edits seem to have attracted some attention. Nothing to worry about, but they are now discussed on meta http://meta.stackexchange.com/questions/194788/links-being-changed-to-cppreference-com – Bart Aug 27 '13 at 14:25
  • @Akusete The only way to change the string buffer size directly by a function member is std::string::shrink_to_fit. std::string::clear just set the size (not the buffer) of string to 0. – TheArchitect Jun 16 '17 at 21:08
  • std::string::clear implementations are free to do anything with the underlying buffer, including freeing it. – codesniffer Feb 05 '18 at 20:54
7

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

http://www.sgi.com/tech/stl/Rope.html

Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)

Martin York
  • 257,169
  • 86
  • 333
  • 562
6

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

class lightweight_string { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

Daniel Earwicker
  • 114,894
  • 38
  • 205
  • 284
2

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

David Norman
  • 19,396
  • 12
  • 64
  • 54
0

In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Alan
  • 45,915
  • 17
  • 113
  • 134
  • 3
    Yes it would, although the inputs (C char*'s) and output (std::string) are not under my control. – Akusete Dec 12 '08 at 02:35