2

Dealing with strings in C definitely makes one wish for a simple class-based language, but I'm trying to instead build a convenient string library. My idea is to use immutable strings, with a struct called Rstring (for "robust string") that has an internal const char* s and int length, such that operations like rstring_concat and rstring_substring return new Rstring objects with their own newly malloc'd character pointers.

Writing up an initial draft of the library, I am pleased with the simplicity and cleanliness of using my library instead of char *. However, I realized that returning a newly allocated pointer is somewhat of a PITA without a destructor. Each time some operation is done, say via concatenation or substrings, you have some newly allocated memory, and then whichever strings you had before are now hanging around, and probably are useless, so they need to be free'd, and C has no destructors, so the user is stuck having to manually go around freeing everything.

Therefore my question is, are there any clever ways to avoid having to make a whole bunch of manual calls to free? Possibly, for example, having internal start and end indices in order to have strings which act like small strings, but really contain quite a bit more? I don't know if there's any generally accepted method of doing this, or if people are simply stuck with tedious memory management in C.

Perhaps best, is there any widely-used library for convenient string manipulation in C?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
limp_chimp
  • 13,475
  • 17
  • 66
  • 105
  • 1
    Presumably, you have an Rstring destructor function: `rstring_destroy(Rstring *old_one)`. The users would not be calling `free()`; they'd be calling `rstring_destroy()`. But that is mainly nit-picking for the sake of nit-picking. – Jonathan Leffler Feb 03 '13 at 02:02
  • Why are you reinventing the wheel? If you use C++, you have `std::string`, `std::wstring`, `std::u16string`, etc. If you want to stick to C, `CFString` (from Apple's Core Foundation) does what you want and is even portable to non-Apple systems via `CFLite`. Is this a learning exercise or something intended for production code? – Jonathan Grynspan Feb 03 '13 at 02:06
  • @JonathanGrynspan it's sort of a learning exercise, but also it's sometimes easier to code up my own library than try to figure out one written by someone else. You're definitely correct that C++ would be easier in this case, but like I said, I'm trying to stick with C. Though, my resolve is waning. – limp_chimp Feb 03 '13 at 02:13
  • why not leave memory management to the caller? – Josh Petitt Feb 03 '13 at 03:31

5 Answers5

3

If you need a better string library for C I would recommend The Better String Library.


C does not have any way of simplifying the memory management. Any memory you allocate using malloc must be freed. If you are working with a lot of strings in one function you could use a special registry where you register strings to. The registry could then destroy all the strings that were registered to it.

e.g. (only the interfaces, no implementation)

void rstring_reg_init(rstring_reg*);
void rstring_reg_destroy(rstring_reg*);
rstring rstring_reg_create(rstring_reg*, const char*);
void rstring_reg_register(rstring_reg*, rstring);
void rstring_reg_detach(rstring_reg*, rstring);

If your strings are mutable you could even create the strings using the registry (I'd rather call it pool then). If the strings were to remember their pool you could even let them register theirselves at creation time. This could lead to rather "beautiful code" like:

rstring f() {
    rstring_reg reg;
    rstring_reg_init(&reg);
    rstring a = rstring_reg_create(reg, "foo");
    rstring b = rstring_reg_create(reg, "bar");
    rstring ab = rstring_concat(a, b);
    rstring s = rstring_substr(ab, 1, 4);
    rstring_detach(s);
    rstring_reg_destroy(&reg);
    return s;
}

What this code would do is this:

  • Create registry
  • Create a and b strings which both know the registry
  • Create a new ab string. It is automatically added to the registry.
  • Create a new s string. It is also added to the registry.
  • Detach s from the registry as we want to return it.
  • Destroy registry. This automatically destroys a, b and ab
  • Return s - The caller of f is now responsible to manage its memory

In the end I'd rather recommend using C++ than using such beast.

What you really want is RAII and this is only possible using C++ or a proprietary GCC extension.

bikeshedder
  • 7,337
  • 1
  • 23
  • 29
0

The "slightly clever" way of declaring an immutable string struct would be something like:

struct Rstring {
    size_t length;
    char s[0];
};

It's the zero-length array hack in action. You're able to allocate Rstring objects as below:

struct Rstring* alloc_rstring(const char* text) {
    size_t len = strlen(text);
    struct Rstring* ret = malloc(sizeof(Rstring) + len + 1));
    ret->length = len;
    memcpy(ret->s, text, len + 1);
    return ret;
}

and free such Rstring objects with just a simple free(), since the string data resides on the same allocation.

Community
  • 1
  • 1
sheu
  • 5,653
  • 17
  • 32
  • The question is how to avoid a lot of manual calls to `free()` and in a different sort of way, automatically collect garbage in C – Aniket Inge Feb 03 '13 at 02:43
  • @Aniket: I realize what you're trying to say, but it doesn't apply in this circumstance. The poster is asking how to avoid having to free the `Rstring`'s internal memory buffers (the `s` member inside), since that is the part that requires explicit calls to `free()` -- essentially, you need two calls to `free()` to destroy one `Rstring`, and that ain't pretty. With the struct hack, you only require one call -- this is isomorphic to the C++ situation of having to make one call to a destructor. – sheu Feb 03 '13 at 02:46
  • @Aniket: you haven't specified *what* to work around, or what the work-around entail. – sheu Feb 03 '13 at 02:50
  • @sheu I don't think it makes a difference. Assuming `Rstrings` are declared as stack variables, not pointers, the `Rstring` structs themselves would be automatically deallocated. Either way, you'd need a call to `free()` for each `Rstring` (or alternatively, write an `rstring_delete` function, which could also delete an `Rstring*` if you wanted to use pointers... – limp_chimp Feb 03 '13 at 02:50
  • @Aniket: your "clever workaround" entails linking in an entire GC library. Ummm... yeah. – sheu Feb 03 '13 at 02:52
  • @sheu, the zen masters told me: "Solve the problem, not part of the problem" – Aniket Inge Feb 03 '13 at 02:53
  • @limp_chimp: it's the heap-allocated case when leaks are a problem. Supplying an `rstring_delete` also works (actually better), but it requires users to remember to `rstring_delete` all their `Rstrings`. – sheu Feb 03 '13 at 02:53
  • @Aniket: the problem with that is that "solving all the problem" is a recursion that does not terminate. At some point you're going to have to decide what subset of the problem you want to solve. Turning C/C++ into Java? Not so much. – sheu Feb 03 '13 at 02:55
  • You solve all the problems by working on a set of problems first and then going up until you've solved all. But the goal is to solve the problem and not part of it(shouldn't leave any stone unturned). Why can't a programmer, if available, use a library that does GC for him in any language possible? Its not that its going to consume a whole lot in the application - – Aniket Inge Feb 03 '13 at 02:57
  • @Aniket: there's a whole lot that can go wrong when you do GC. Off the top of my head -- You lose predictability in execution time (when the GC decides to stop the world and do a pass), you can't explicitly manage and free memory (possibly bloating your working set), you lose memory tracking that you could be using to profile performance, etc. It's no panacea. – sheu Feb 03 '13 at 03:00
  • And how does any of that matter to the `rstring` library? – Aniket Inge Feb 03 '13 at 03:04
  • @Aniket: it's a heavyweight solution that causes more problems for the rest of the program, when a simple solution for `Rstring` would suffice. That's the relevance. – sheu Feb 03 '13 at 03:05
0

What you really need is a garbage collector for C .

LISP and Java programmers take garbage collection for granted. With the Boehm-Demers-Weiser library, you easily can use it in C and C++ projects, too.

Aniket Inge
  • 25,375
  • 5
  • 50
  • 78
0

The whole point of immutable strings, at least as I understand it, is to be able to avoid copies by sharing storage. (You can also avoid some locking issues if that's important.) But in that case, you really can't allow the client to free a string, unless you force them to maintain reference counts, which is really a pain.

In many applications, though, you can do what the Apache Runtime (APR) does, and use memory pools. Objects created in a memory pool are not freed individually; the entire pool is freed. This requires some lifetime analysis, and it can lead to obscure bugs because the compiler doesn't do the bookkeeping for you, but it's generally less work than reference counts, and the other payoff is that both allocation and deallocation are really fast. This might counterbalance the other disadvantage, which is that releasing storage is somewhat imprecise.

The memory pool architecture works best if you've got some sort of request-based control flow, typical of servers (but also of certain kinds of UI applications). In the request-based structure, each request initializes a memory pool, and frees it when request processing finishes. If the request makes some kind of permanent change to the state of the server, you may have to move some data from the temporary pool to a (more) permanent one, but such changes are relatively rare. For requests which can be gracefully divided into stages, some of which are temporary-memory-hungry, it's possible to nest memory pools; here, again, you may need to explicitly flag certain strings for an enclosing memory pool, or move them before their memory pool is deleted.

Since all objects in a memory pool are deleted at the same time, it is possible to implement finalizers by attaching the finalizer to the pool itself. The pool can keep a simple linked list of finalizers, which are executed sequentially (in reverse creation order) before the pool is actually freed. This allows finalizers to refer to arbitrary other objects in the same pool, provided that they don't invalidate the state of objects; the restriction is not too draconian since by and large finalizers are about managing non-memory resources like file descriptors.

APR does not have immutable strings (or, at least, it didn't last time I looked at it), so Apache ends up doing a lot of unnecessary string copying. That's a design choice; I'm not voting one way or the other on it here.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Another way to use immutable strings is to have all access be done through handles (e.g. with a method that takes a handle, a `const char *dat` and an `int length` and makes stores a reference to the new string into the handle, or which accepts a handle, a `char *dest`, an offset, and a request length, and copies the appropriate number of bytes to the destination buffer). Other methods could copy a reference from one handle to another, concatenate or clip strings, etc. Any code which receives a new handle must delete it, but there's no need to worry about whether other references identify... – supercat Mar 30 '15 at 21:20
  • ...the same string as the handle used to. – supercat Mar 30 '15 at 21:20
0

You may like to look at Glib, which has many useful data structures and services, which include type Gstring. There is a printf that puts its result directly into a Gstring. Very handy.

At heart your question is more about memory allocation than about strings per se. You get little help from C for memory allocation of strings or anything else. There are a few general approaches:

  1. Manual: Call malloc() and free() for every object.
  2. Conservative garbage collection: The Boehm collector is very mature. Though it has some problems, they're well understood at this point. Certain quirky coding methods and aggressive optimizations must be avoided.
  3. Reference counting garbage collection: This requires that you to call reference and dereference routines that increment/decrement a counter in the string at the creation and destruction of each pointer to it. When a decrement takes the count to zero, the string is freed. This is just as tedious and error prone as malloc() and free(), but it lets you handle complicated cases where the string's lifetime may end in any of several separate chunks of code.
  4. Pool allocation. In this scheme, you create any number of allocation pools. When creating a string, you specify the pool where it should be allocated. Pools are freed all at once.
  5. Stack allocation. Strings are allocated from a single memory pool that has Mark and Release operations. Calling Release frees all strings back to the point where Mark was last called.

Combinations of all above are common. For example, GNU Obstacks combine 4 and 5.

Gene
  • 46,253
  • 4
  • 58
  • 96