40

I recently got inspired to start up a project I've been wanting to code for a while. I want to do it in C, because memory handling is key this application. I was searching around for a good implementation of strings in C, since I know me doing it myself could lead to some messy buffer overflows, and I expect to be dealing with a fairly big amount of strings.

I found this article which gives details on each, but they each seem like they have a good amount of cons going for them (don't get me wrong, this article is EXTREMELY helpful, but it still worries me that even if I were to choose one of those, I wouldn't be using the best I can get). I also don't know how up to date the article is, hence my current plea.

What I'm looking for is something that may hold a large amount of characters, and simplifies the process of searching through the string. If it allows me to tokenize the string in any way, even better. Also, it should have some pretty good I/O performance. Printing, and formatted printing isn't quite a top priority. I know I shouldn't expect a library to do all the work for me, but was just wandering if there was a well documented string function out there that could save me some time and some work.

Any help is greatly appreciated. Thanks in advance!

EDIT: I was asked about the license I prefer. Any sort of open source license will do, but preferably GPL (v2 or v3).

EDIt2: I found betterString (bstring) library and it looks pretty good. Good documentation, small yet versatile amount of functions, and easy to mix with c strings. Anyone have any good or bad stories about it? The only downside I've read about it is that it lacks Unicode (again, read about this, haven't seen it face to face just yet), but everything else seems pretty good.

EDIT3: Also, preferable that its pure C.

chamakits
  • 1,865
  • 1
  • 17
  • 26
  • 1
    `` I'm writing a (hobby) framework that includes a string type, does that count? `` – Chris Lutz Jan 14 '11 at 04:52
  • 2
    You should mention what kind of license you do or don't want as well, since some of the best contenders are GPL. – detly Jan 14 '11 at 05:30
  • @Chris you can plug it if you want :P I may take a look at it, but if its still young I probably won't use it for my project. Nothing personal, its just that c strings are known to be tricky, and until it's been throughly tested (which I can help with :P), I wouldn't feel comfortable using it in my code base. – chamakits Jan 14 '11 at 07:26
  • 4
    I would suggest C++ but for some reason you wish to make life hard for yourself. – David Heffernan Jan 14 '11 at 07:36

6 Answers6

44

It's an old question, I hope you have already found a useful one. In case you didn't, please check out the Simple Dynamic String library on github. I copy&paste the author's description here:

SDS is a string library for C designed to augment the limited libc string handling functionalities by adding heap allocated strings that are:

  • Simpler to use.
  • Binary safe.
  • Computationally more efficient.
  • But yet... Compatible with normal C string functions.

This is achieved using an alternative design in which instead of using a C structure to represent a string, we use a binary prefix that is stored before the actual pointer to the string that is returned by SDS to the user.

+--------+-------------------------------+-----------+
| Header | Binary safe C alike string... | Null term |
+--------+-------------------------------+-----------+
         |
         `-> Pointer returned to the user.

Because of meta data stored before the actual returned pointer as a prefix, and because of every SDS string implicitly adding a null term at the end of the string regardless of the actual content of the string, SDS strings work well together with C strings and the user is free to use them interchangeably with real-only functions that access the string in read-only.

Steinway Wu
  • 1,288
  • 1
  • 12
  • 18
  • This is an extremely good find, and a good track record (used in redis). I'll definitely keep this in mind next time I'm looking to write some string heavy C code. – chamakits Aug 03 '14 at 04:58
  • I'll definitively try this one. –  Oct 05 '15 at 15:52
  • 2
    You might like to mention that it's used by Redis and that the author is the principal author of Redis? – Vérace Jan 30 '21 at 14:40
14

I would suggest not using any library aside from malloc, free, strlen, memcpy, and snprintf. These functions give you all of the tools for powerful, safe, and efficient string processing in C. Just stay away from strcpy, strcat, strncpy, and strncat, all of which tend to lead to inefficiency and exploitable bugs.

Since you mentioned searching, whatever choice of library you make, strchr and strstr are almost certainly going to be what you want to use. strspn and strcspn can also be useful.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Could you provide more details about strcpy, strcat ? I use them often and I never saw memcpy+strlen being faster. – Benoit Thiery Jan 14 '11 at 10:53
  • Hm....I'll take this into consideration. My biggest worry is that quite frankly, its been a while since I coded C, and I've read about projects getting their code exploitable with ease because of inexperience with C. True, I should just tackle this head on, and get the experience, but I just thought I'd first use something fully built, see how it works and what I would usually need, and then eventually jumping into my own. Still, I'll keep this recommendation in mind. Thanks! – chamakits Jan 14 '11 at 11:06
  • 6
    @Benoit: It's basically impossible to use `strcat` correctly because to know whether the new part will fit in your buffer, you already need to know both string lengths and the buffer size. Not only does `strcat` wastefully recompute the first 2, but it also makes it easy for you to ignore the fact that you're not aware of whether they fit in the buffer. If you don't use `strcat`, you'll be certain to always have the right lengths on-hand. `strcpy` isn't as bad, but the same applies. Note that constructing a string piece-by-piece with `strcat` also happens to be `O(n^2)`. – R.. GitHub STOP HELPING ICE Jan 14 '11 at 17:20
  • 3
    You forgot `memmove`. That one's helpful. – Chris Lutz Jan 14 '11 at 22:47
  • 4
    In this case, I'll disagree with @R.. Postfix and Qmail have their own string handling routines, the GNU project has obstacks, and I've certainly rolled my own before. I'd favour a high-level library for dynamically-allocated strings over low-level `str*()` for anything moderately complex (even if that library consists of a single function for creating/appending to a string). – ninjalj Jul 20 '11 at 16:29
  • @ninjalj: Notice I did not advocate for using `str*()` functions. In fact I consider most of them harmful, and I consider the operation of "append" or "concatenate" to string fundamentally harmful. – R.. GitHub STOP HELPING ICE Jul 20 '11 at 18:03
  • There is very little that cannot be done with the standard functions provided in the C standard library, everything else can be covered with a couple of pointers and arithmetic. – David C. Rankin Aug 14 '16 at 08:06
  • Really strange answer. Instead of using some tested, efficient and well optimized library, you suggest using this archaic concept of null terminated strings. The fact `strlen` is O(n) is laughable, the fact it opens you up to UB is unacceptable. – Larry Teischwilly Oct 08 '22 at 10:22
4

If you really want to get it right from the beginning, you should look at ICU, i.e. Unicode support, unless you are sure your strings will never hold anything but plain ASCII-7... Searching, regular expressions, tokenization is all in there.

Of course, going C++ would make things much easier, but even then my recommendation of ICU would stand.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Of course. You get to choose C, which is a canoe, or C++, which is a steamboat. Still waiting for someone to invent a decent boat somewhere between. – James M. Lay Mar 18 '21 at 18:33
3

I also found a need for an external C string library, as I find the <string.h> functions very inefficient, for example:

  • strcat() can be very expensive in performance, as it has to find the '\0' char each time you concatenate a string
  • strlen() is expensive, as again, it has to find the '\0' char instead of just reading a maintained length variable
  • The char array is of course not dynamic and can cause very dangerous bugs (a crash on segmentation fault can be the good scenario when you overflow your buffer)

The solution should be a library that does not contain only functions, but also contains a struct that wraps the string and that enables to store important fields such as length and buffer-size

I looked for such libraries over the web and found the following:

  1. GLib String library (should be best standard solution) - https://developer.gnome.org/glib/stable/glib-Strings.html
  2. http://locklessinc.com/articles/dynamic_cstrings/
  3. http://bstring.sourceforge.net/

Enjoy

SomethingSomething
  • 11,491
  • 17
  • 68
  • 126
2

Please check milkstrings.
Sample code :

int main(int argc, char * argv[]) {
  tXt s = "123,456,789" ;
  s = txtReplace(s,"123","321") ; // replace 123 by 321
  int num = atoi(txtEat(&s,',')) ; // pick the first number
  printf("num = %d s = %s \n",num,s) ;
  s = txtPrintf("%s,%d",s,num) ; // printf in new string
  printf("num = %d s = %s \n",num,s) ;
  s = txtConcat(s,"<-->",txtFlip(s),NULL) ; // concatenate some strings
  num = txtPos(s,"987") ; // find position of substring
  printf("num = %d s = %s \n",num,s) ;
  if (txtAnyError()) { //check for errors
    printf("%s\n",txtLastError()) ;
    return 1 ; }
  return 0 ;
  }
archimedes
  • 89
  • 5
  • Any library that includes `while (!feof(fi))` is suspect to begin with. See: [**Why is “while ( !feof (file) )” always wrong?**](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong?s=1|2.6948) – David C. Rankin Aug 14 '16 at 08:10
  • 1
    while (!feof(fi)) is not in the library. It was part of some sample code. – archimedes Aug 15 '16 at 11:35
  • In the single file `milkstrings.c`, `while (!feof (fi)) {` appears at lines `306` and `371`. While true, the instances are wrapped in the `txtSKIPEXAMP` label, both stick out like a sore thumb that has been hit with a sledge hammer. e.g. `grep -nhA10 'feof\|^#if.*SKIP' milkstrings.c` I'm not your downvote, I'm just pointing out that whether conditionally compiled or not, using the `while (!feof (fi)) {` in this manner is a known bad idea that can lead directly to undefined behavior. – David C. Rankin Aug 15 '16 at 13:09
1

I faced this problem recently, the need for appending a string with millions of characters. I ended up doing my own.

It is simply a C array of characters, encapsulated in a class that keeps track of array size and number of allocated bytes.

The performance compared to SDS and std::string is 10 times faster with the benchmark below

at

https://github.com/pedro-vicente/table-string

Benchmarks

For Visual Studio 2015, x86 debug build:

| API                   | Seconds           
| ----------------------|----| 
| SDS                   | 19 |  
| std::string           | 11 |  
| std::string (reserve) | 9  |  
| table_str_t           | 1  |  

clock_gettime_t timer;
const size_t nbr = 1000 * 1000 * 10;
const char* s = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb";
size_t len = strlen(s);
timer.start();
table_str_t table(nbr *len);
for (size_t idx = 0; idx < nbr; ++idx)
{
  table.add(s, len);
}
timer.now("end table");
timer.stop();

EDIT Maximum performance is achieved by allocating the string all at start (constructor parameter size). If a fraction of total size is used, performance drops. Example with 100 allocations:

std::string benchmark append string of size 33, 10000000 times
end str:        11.0 seconds    11.0 total
std::string reserve benchmark append string of size 33, 10000000 times
end str reserve:        10.0 seconds    10.0 total
table string benchmark with pre-allocation of 330000000 elements
end table:      1.0 seconds     1.0 total
table string benchmark with pre-allocation of ONLY 3300000 elements, allocation is MADE 100 times...patience...
end table:      9.0 seconds     9.0 total
Pedro Vicente
  • 681
  • 2
  • 9
  • 21
  • The sds library is in C, while your table-string is in C++. Also, the reason why sds seem so much slower is that your table is pre-allocated, while sds will re-allocate multiple times to grow. Try it again with sdsgrowzero(s, nbr*len), or even faster sdsMakeRoomFor() and see if your benchmarks hold up. – OlivierD Jul 10 '17 at 21:19
  • @Olivier. That is correct. I noticed that when I wrote it, if I drop the number of allocations, say to 100 times, the benchmark is of the same time as std::string. I edited the post and added the benchmark to github – Pedro Vicente Sep 08 '17 at 19:36