0

For clarity I'm only talking about null terminated strings.

I'm familiar with the standard way of doing string comparisons in C with the usage of strcmp. But I feel like it's slow and inefficient.

I'm not necessarily looking for the easiest method but the most efficient.

Can the current comparison method (strcmp) be optimized further while the underlying code remains cross platform?

If strcmp can't be optimized further, what is the fastest way which I could perform the string comparison without strcmp?

Current use case:

  • Determine if two arbitrary strings match
  • Strings will not exceed 4096 bytes, nor be less than 1 byte in size
  • Strings are allocated/deallocated and compared within the same code/library
  • Once comparison is complete I do pass the string to another C library which needs the format to be in a standard null terminated format
  • System memory limits are not a huge concern, but I will have tens of thousands of such strings queued up for comparison
  • Strings may contain high-ascii character set or UTF-8 characters but for my purposes I only need to know if they match, content is not a concern
  • Application runs on x86 but should also run on x64

Reference to current strcmp() implementation:

Edit: Clarified the solution does not need to be a modification of strcmp.

Edit 2: Added specific examples for this use case.

Community
  • 1
  • 1
Joshua Briefman
  • 3,783
  • 2
  • 22
  • 33
  • 4
    Why do you think `strcmp()` is not sufficiently optimized? – e0k Jan 02 '17 at 01:06
  • 2
    I'm pretty sure `strcmp` is already optimized for whatever platform you're on. – Andy Schweig Jan 02 '17 at 01:07
  • 1
    I don't have a definitive answer, but I doubt it. You need to to look at every single character of both strings in order to compare them, and I don't see any way around that. I bet strcmp() does a better job of it than "some guy" could do in an afternoon. – Erik Nyquist Jan 02 '17 at 01:07
  • @e0k I didn't say it wasn't "optimized". I said I feel it's slow and could be faster. – Joshua Briefman Jan 02 '17 at 01:11
  • 3
    I'd *prove* that via *real* benchmarks and disassembly analysis before looking to fix a wheel whose only evidence of disrepair is a "feeling". Most platforms have highly optimized intrinsics to perform tasks like this, provided you enable them with proper optimization configuration for your toolchain. – WhozCraig Jan 02 '17 at 01:15
  • 1
    @JoshuaBriefman, how is your "slow and could be faster" different from e0k's "not sufficiently optimized"? However you word it, there's good reason to think that the standard library's `strcmp()` implementation will be very efficient. You've presented no basis for thinking otherwise. – John Bollinger Jan 02 '17 at 01:15
  • @JohnBollinger I am seeing confusion in the comments between the use of "optimal" and "fastest". I am not arguing if strcmp is optimized or not, I am asking if it could be made faster. – Joshua Briefman Jan 02 '17 at 01:18
  • I edited my question to include a reference to the current implementation to help those want to try and answer this question. – Joshua Briefman Jan 02 '17 at 01:21
  • @WhozCraig Please suggest those specific compiler/toolchain flags for a few platforms as your answer if this can result in a marked improvement over the default behavior. I'd love to use them. I'm not limited to any specific compiler, so you can use GCC/LLVM, or any other example. – Joshua Briefman Jan 02 '17 at 01:26
  • 3
    If the strings compared are either very short or are very long and very similar, you may indeed be able to implement a better comparison for that specific case. If you have to compare the same string with other strings many times, you should probably consider using hashing, so you can avoid examining the contents of that constant string over and over again. You need to tell us more about what you're doing, the context in which you find strcmp() insufficiently fast. – Alexey Frunze Jan 02 '17 at 01:35
  • @AlexeyFrunze Thanks, I'll add some more detail for my specific use case. – Joshua Briefman Jan 02 '17 at 01:37
  • @JoshuaBriefman - what is the source of these tens of thousands of strings? Are the strings read from a file? Are the strings all read into a single buffer or is each string separately allocated? – rcgldr Jan 02 '17 at 04:49
  • @the strings come from both twitter and news articles. So processing information quickly is critical, every millisecond matters. – Joshua Briefman Jan 03 '17 at 07:38

1 Answers1

4

I'm afraid your reference imlementation for strcmp() is both inaccurate and irrelevant:

  • it is inaccurate because it compares characters using the char type instead of the unsigned char type as specified in the C11 Standard:

    7.24.4 Comparison functions

    The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

  • It is irrelevant because the actual implementation used by modern compilers is much more sophisticated, expanded inline using hand-coded assembly language.

Any generic implementation is likely to be less optimal, especially if coded to remain portable across platforms.

Here are a few directions to explore if your program's bottleneck is comparing strings.

  • Analyze your algorithms, try and find ways to reduce the number of comparisons: for example if you search for a string in an array, sorting that array and using a binary search with drastically reduce the number of comparisons.
  • If your strings are tokens used in many different places, allocate unique copies of these tokens and use those as scalar values. The strings will be equal if and only if the pointers are equal. I use this trick in compilers and interpreters all the time with a hash table.
  • If your strings have the same known length, you can use memcmp() instead of strcmp(). memcmp() is simpler than strcmp() and can be implemented even more efficiently in places where the strings are known to be properly aligned.

EDIT: with the extra information provided, you could use a structure like this for your strings:

typedef struct string_t {
    size_t len;
    size_t hash;  // optional
    char str[];   // flexible array, use [1] for pre-c99 compilers
} string_t;

You allocate this structure this way:

string_t *create_str(const char *s) {
    size_t len = strlen(s);
    string_t *str = malloc(sizeof(*str) + len + 1;
    str->len = len;
    str->hash = hash_str(s, len);
    memcpy(str->str, s, len + 1);
    return str;
}

If you can use these str things for all your strings, you can greatly improve the efficiency of the matching by first comparing the lengths or the hashes. You can still pass the str member to your library function, it is properly null terminated.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Are you saying the compiler doesn't actually use the standard C lib strcmp when it is used in an app's code? – Joshua Briefman Jan 02 '17 at 01:53
  • I do like the idea of memcmp() with fixed length strings, that could potentially help speed up the comparison. – Joshua Briefman Jan 02 '17 at 01:56
  • @JoshuaBriefman: The compiler generates code that implements the standard definition of `strcmp()`. It may issue calls to the C library `strcmp()` implementation or generate inline code that does not. – chqrlie Jan 02 '17 at 02:21
  • 2
    @rcgldr: a shorter string **is** considered smaller, but `memcmp()` should not be given strings whose length is less than its size argument minus one. `memcmp()` could dereference bytes beyond the end of these short strings and invoke undefined behavior on some systems. For example `memcmp(a, b, 8)` could read and compare 64 bit at a time. – chqrlie Jan 02 '17 at 02:24
  • @chqrlie That's a great example, I can see how using a structure to keep my strings would help improve the comparisons significantly. If I expected most strings wouldn't match then I could even start with a simple length comparison from the struct before initiating hash comparison, if the length doesn't match we can fail even faster. – Joshua Briefman Jan 02 '17 at 02:52
  • @JoshuaBriefman: indeed if your string lengths are sufficiently diverse, you can forget the hash. – chqrlie Jan 02 '17 at 02:59
  • @chqrlie: Hash functions are not free, surprisingly. In fact, hashing a string requires looking at every character, and may well take *significantly* more time than a string comparison, particularly if most comparisons are unequal. I'm not saying this is a bad idea, but it is appropriate only for certain use cases in which there are significantly more comparisons than strings (that is, every string is compared many times). If you were going to use this technique and you don't mind strings being immutable, you could intern the strings using a hash table, and then just compare addresses. – rici Jan 02 '17 at 03:54
  • @rici: I agree with you completely. These are directions for the OP to investigate. Depending on his actual string usage, computing the hash may help or not. Note that allocating the strings already has a cost somewhat comparable to hashing. Interning the strings in a hash table could be very effective as I mentioned in my second point. – chqrlie Jan 02 '17 at 04:02
  • @chqrlie: oops, missed that line, sorry. – rici Jan 02 '17 at 04:06
  • @rici: well I edited the line to mention the hash table, but this idea is my favorite approach to fast identifier lookups on small sets. – chqrlie Jan 02 '17 at 04:09
  • @chqrlie - I deleted my prior comment about using memcmp(). This would require all the strings be stored in one or more buffers, so that although memcmp() could go past the end of strings, it would not go past the end of the buffers that the strings are stored in. The OP has not stated how the strings are obtained or stored in memory. – rcgldr Jan 02 '17 at 04:51
  • Don't forget that it _is_ possible for two different strings to have the same hash, it is just extremely unlikely. You can quickly recognize that two strings with different hashes must be different, but to decide that two strings are the same, you must compare more than their hash values. – e0k Jan 03 '17 at 03:54
  • @e0k: of course the hash is used as a quick and efficient way to detect mismatches. If the hash values are identical and the lengths too, the strings must still be compared with `memcmp()`. – chqrlie Jan 03 '17 at 04:35