String-handling practices in C

Question

I'm starting a new project in plain C (c99) that is going to work primarily with text. Because of external project constraints, this code has to be extremely simple and compact, consisting of a single source-code file without external dependencies or libraries except for libc and similar ubiquitous system libraries.

With that understanding, what are some best-practices, gotchas, tricks, or other techniques that can help make the string handling of the project more robust and secure?

Does using C in the first place count as a gotcha? `/* ouch, stop! just kidding! */` — , Jan 02 '11 at 20:33
@anatolyg: Why would anyone want to use wide characters? Uhg. — R.. GitHub STOP HELPING ICE, Jan 02 '11 at 21:02
This question is amazing … so many wrong answers based on misconceptions about apparently “safe” solution. Wow, C makes safe string handling *really hard*. — Konrad Rudolph, Jan 03 '11 at 08:57
@Konrad: The only thing that makes it hard is the enormous volume of misinformation and bad examples out there. — R.. GitHub STOP HELPING ICE, Jan 03 '11 at 14:02
@R..: I think what Konrad is saying is that practices for "safe" string handling are not obvious in C. Sure, anything's easy if you know how to do it. But C is an "enough rope" language with little direction and lots of bad precedents. That makes doing things "right" hard. — tylerl, Jan 03 '11 at 22:54
@tylerl: I agree completely. The amount of "bad precedent" code out there is probably the single worst thing about the C language. — R.. GitHub STOP HELPING ICE, Jan 05 '11 at 20:34
I would like to state here that I put a bounty on this excellent question. The question was closed two days before the bounty's term was up. I think that this is a very important question and very difficult to find accurate information for. — dotancohen, Aug 21 '12 at 14:56
@dotancohen ironic that the question stood for almost 2 years. Still, at least closed questions don't get deleted. — tylerl, Aug 21 '12 at 17:52

score 33 · Accepted Answer · answered Jan 02 '11 at 21:01

Without any additional information about what your code is doing, I would recommend designing all your interfaces like this:

size_t foobar(char *dest, size_t buf_size, /* operands here */)

with semantics like snprintf:

dest points to a buffer of size at least buf_size.
If buf_size is zero, null/invalid pointers are acceptable for dest and nothing will be written.
If buf_size is non-zero, dest is always null-terminated.
Each function foobar returns the length of the full non-truncated output; the output has been truncated if buf_size is less than or equal to the return value.

This way, when the caller can easily know the destination buffer size that's required, a sufficiently large buffer can be obtained in advance. If the caller cannot easily know, it can call the function once with either a zero argument for buf_size, or with a buffer that's "probably big enough" and only retry if you ran out of space.

You can also make a wrapped version of such calls analogous to the GNU asprintf function, but if you want your code to be as flexible as possible I would avoid doing any allocation in the actual string functions. Handling the possibility of failure is always easier at the caller level, and many callers can ensure that failure is never a possibility by using a local buffer or a buffer that was obtained much earlier in the program so that the success or failure of a larger operation is atomic (which greatly simplifies error handling).

score 10 · Answer 2 · answered Jan 03 '11 at 14:44

Some thoughts from a long-time embedded developer, most of which elaborate on your requirement for simplicity and are not C-specific:

Decide which string-handling functions you'll need, and keep that set as small as possible to minimize the points of failure.
Follow R.'s suggestion to define a clear interface that is consistent across all string handlers. A strict, small-but-detailed set of rules allows you to use pattern-matching as a debugging tool: you can be suspicious of any code that looks different from the rest.
As Bart van Ingen Schenau noted, track the buffer length independently of the string length. If you'll always be working with text it's safe to use the standard null character to indicate end-of-string, but it's up to you to ensure the text+null will fit in the buffer.
Ensure consistent behavior across all string handlers, particularly where the standard functions are lacking: truncation, null inputs, null-termination, padding, etc.
If you absolutely need to violate any of your rules, create a separate function for that purpose and name it appropriately. In other words, give each function a single unambiguous behavior. So you might use str_copy_and_pad() for a function that always pads its target with nulls.
Wherever possible, use safe built-in functions (e.g. memmove() per Jonathan Leffler) to do the heavy lifting. But test them to be sure they're doing what you think they're doing!
Check for errors as soon as possible. Undetected buffer overruns can lead to "ricochet" errors that are notoriously difficult to locate.
Write tests for every function to ensure it satisfies its contract. Be sure to cover the edge cases (off by 1, null/empty strings, source/destination overlap, etc.) And this may sound obvious, but be sure you understand how to create and detect a buffer underrun/overrun, then write tests that explicitly generate and check for those problems. (My QA folks are probably sick of hearing my instructions to "don't just test to make sure it works; test to make sure it doesn't break.")

Here are some techniques that have worked for me:

Create wrappers for your memory-management routines that allocate "fence bytes" on either end of your buffers during allocation and check them upon deallocation. You can also verify them within your string handlers, perhaps when a STR_DEBUG macro is set. Caveat: you'll need to test your diagnostics thoroughly, lest they create additional points of failure.
Create a data structure that encapsulates both the buffer and its length. (It can also contain the fence bytes if you use them.) Caveat: you now have a non-standard data structure that your entire code base must manage, which may mean a substantial re-write (and therefore additional points of failure).
Make your string handlers validate their inputs. If a function forbids null pointers, check for them explicitly. If it requires a valid string (like strlen() should) and you know the buffer length, check that the buffer contains a null character. In other words, verify any assumptions you might be making about the code or data.
Write your tests first. That will help you understand each function's contract--exactly what it expects from the caller, and what the caller should expect from it. You'll find yourself thinking about the ways you'll use it, the ways it might break, and about the edge cases it must handle.

Thanks so much for asking this question! I wish more developers would think about these issues--especially before they start coding. Good luck, and best wishes for a robust, successful product!

+1 for the first half and the last paragraph. They're good enough that I'm withholding judgement on the second half. :-) Testing for null pointers is one of my Considered Harmfuls and while fancy structures can help you debug, they also make your code a lot more of a pain to use and integrate with other code. I would instead work on rigorously testing your functions to satisfy their contracts, and then you don't need any further runtime checking mess. — R.. GitHub STOP HELPING ICE, Jan 03 '11 at 17:17
@R.: I agree that thorough tests are key, and that "fancy structures" often do more harm than good. But simple ones, used correctly, can also increase robustness and testability, particularly if they're an integral part of the design. Can you explain your dislike of null-pointer tests? I'm a fan of testing everything I can, but I'll often use macros (e.g. #ifdef TEST) to surround tests that may be expensive or redundant. Thanks for your thoughts! — Adam Liss, Jan 03 '11 at 17:39
My part in a long discussion on the merits or demerits of testing for NULL pointers is here: http://stackoverflow.com/questions/4390007/in-either-c-or-c-should-i-check-pointer-parameters-for-null/4390210#4390210 — R.. GitHub STOP HELPING ICE, Jan 03 '11 at 23:02
Thanks for the (non-NULL) reference! Lots of good ideas there. I think tylerl's situation is unusual in that he's starting from scratch and therefore controls the complete design. So, here, it's less a question of "playing nice with others" and all about defining a coherent set of strict rules and enforcing them everywhere. At least until the code is mature, I'd rather err on the side of over-validating than miss subtle bugs that propagate through the developing system. — Adam Liss, Jan 03 '11 at 23:24

score 7 · Answer 3 · answered Jan 02 '11 at 20:43

7

Have a look at strlcpy and strlcat , see the original paper for details.

answered Jan 02 '11 at 20:43

ismail

46,010
9
86
95

5

Concatenation is almost always a bad idiom (in terms of safety, performance, and complexity). – R.. GitHub STOP HELPING ICE Jan 02 '11 at 20:52

score 2 · Answer 4 · answered Jan 02 '11 at 20:26

2

Two cents:

Always use the "n" version of the string functions: strncpy, strncmp, (or wcsncpy, wcsncmp etc.)
Always allocate using the +1 idiom: e.g. char* str[MAX_STR_SIZE+1], and then pass MAX_STR_SIZE as the size for the "n" version of the string functions and finish with str[MAX_STR_SIZE] = '\0'; to make sure all strings are properly finalized.

The final step is important since the "n" version of the string functions won't append '\0' after copying if the maximum size was reached.

answered Jan 02 '11 at 20:26

hillel

2,343
2
18
25

7

-1 s/always/never/ for `strncpy`. This function does not do what people think it does, and it's almost never useful. – R.. GitHub STOP HELPING ICE Jan 02 '11 at 20:51
@R..: What is it exactly that you think its doing? Can you give a concrete example? – hillel Jan 02 '11 at 20:57
Luckily we can take a look in the library source to see what it is actually doing: http://git.uclibc.org/uClibc/tree/libc/string/strncpy.c. Like all of the rest of C, it seems fine to me provided you understand that your string MUST terminate with zero and you follow hillel's advice. +1. – Jan 02 '11 at 21:05
6

`strncpy` does not null-terminate, and does null-pad the whole remainder of the destination buffer, which is a huge waste of time. You can work around the former by adding your own null padding, but not the latter. It was never intended for use as a "safe string handling" function, but for working with fixed-size fields in Unix directory tables and database files. `snprintf(dest, n, "%s", src)` is the only correct "safe strcpy" in standard C, but it's likely to be a lot slower. – R.. GitHub STOP HELPING ICE Jan 02 '11 at 21:14
2

By the way, truncation in itself can be a major bug and in some cases might lead to privilege elevation or DoS, so throwing "safe" string functions that truncate their output at a problem is not a way to make it "safe" or "secure". Instead, you should ensure that the destination buffer is the right size and simply use `strcpy` (or better yet, `memcpy` if you already know the source string length). – R.. GitHub STOP HELPING ICE Jan 02 '11 at 21:15
6

Note that `strncat()` is even more confusing in its interface than `strncpy()` - what exactly is that length argument, again? It isn't what you'd expect based on what you supply `strncpy()` etc - so it is more error prone even than `strncpy()`. For copying strings around, I'm increasingly of the opinion that there is a strong argument that you only need `memmove()` because you always know all the sizes ahead of time and make sure there's enough space ahead of time. Use `memmove()` in preference to any of `strcpy()`, `strcat()`, `strncpy()`, `strncat()`, `memcpy()`. – Jonathan Leffler Jan 02 '11 at 21:24
Well, I didn't know all that so I've learnt a hell of a lot tonight. Thanks for the info. – Jan 02 '11 at 21:28
@R: I agree that the API for strnxxx has its gotchas, but I don't agree that it's a reason to avoid it completely. Especially when the alternative you suggest is not standard and will require re-implementing the string functions or using non-portable extensions (that has its gotchas, too.) You make some valid points about performance and truncation issues, but these may be acceptable in many cases. I will up-vote you though, as your approach (done right), does seem superior in terms of security & performance. Also note that MS has the non-portable strcpy_s function family for this purpose. – hillel Jan 02 '11 at 21:55
I don't think the gotchas alone are the reason not to use the `strn*` functions. They're only half of the story. The other half is that these functions were designed for a completely different purpose, which just happens to have behavior *similar* to the desired behavior. This is something like using UTF-8 encoding/decoding functions as a variable-length-encoding for non-character integer data. It might work, but it's abusing coincidental similarity between behavior and probably has ugly corner cases you don't want to deal with. – R.. GitHub STOP HELPING ICE Jan 02 '11 at 22:49

score 0 · Answer 5 · answered Aug 17 '12 at 11:00

0

When it comes to time vs space, don't forget to pick the standard bit twiddling from here

During my early firmware projects, I used the look up tables to count the bit set in a O(1) operation efficiency.

answered Aug 17 '12 at 11:00

tsenapathy

5,646
3
26
26

score 0 · Answer 6 · edited Jan 03 '11 at 08:56

0

Work with arrays on the stack whenever this is possible and initialize them properly. You don't have to keep track of allocations, sizes and initializations.
```
char myCopy[] = { "the interesting string" };
```
For medium sized strings C99 has VLA. They are a bit less usable since you can't initialize them. But you still have the first two of the above advantages.
```
char myBuffer[n];
myBuffer[0] = '\0';
```

edited Jan 03 '11 at 08:56

Konrad Rudolph

530,221
131
937
1,214

answered Jan 02 '11 at 22:33

Jens Gustedt

76,821
6
102
177

6

VLA is a **stack overflow** waiting to happen. (The bad kind of SO, not the good kind that does your homework for you. :-) – R.. GitHub STOP HELPING ICE Jan 02 '11 at 23:09
@R, yes, if one misuses the feature. But there are appropriate uses, with the necessary checks. After all, you wouldn't do a `malloc` from unchecked user input either. – Matthew Flaschen Jan 03 '11 at 01:32
@Matthew: The only good use for VLA I've ever found required some serious invention, and it's still only a theoretical class of problems: recursive problems where the total memory usage across all levels of recursion is known to be `O(n)` but where any one level could use up to `C*n` space (obviously only a finite number of levels could actually use this much). VLA is the only tool that makes it possible without `malloc`, assuming `O(n^2)` does not fit on the stack. But for every real-world case I've ever seen, a moderately large constant dimension is just as good as VLA. – R.. GitHub STOP HELPING ICE Jan 03 '11 at 14:07
@R.. there are cases for VLA which are not too bad, in particular pointers to VLA can even be used with `malloc` if you like. By that you can have some sort of very basic dynamic types. But sure that is off topic, here. For the use of them on the stack I'd stick to that "medium sized" that I mentioned in the answer. – Jens Gustedt Jan 03 '11 at 16:08

score 0 · Answer 7 · answered Jan 03 '11 at 12:56

Some important gotchas are:

In C, there is no relation at all between string length and buffer size. A string always runs up to (and including) the first '\0'-character. It is your responsibility as a programmer to make sure this character can be found within the reserved buffer for that string.
Always explicitly keep track of buffer sizes. The compiler keeps track of array sizes, but that information will be lost to you before you know it.

String-handling practices in C

7 Answers7

Linked