working with strings in c

Question

may someone please help me understand these lines of code in the program below this program according the writer it writes a string of hello world then there is a function in it that also reverses the string to world hello,my quest is what does this code do?

char * p_divs = divs; //what does divs do
    char tmp;
    while(tmp = *p_divs++)
        if (tmp == c) return 1

;

also this code in the void function

*dest = '\0';//what does this pointer do?
    int source_len = strlen(source); //what is source
    if (source_len == 0) return;
    char * p_source = source + source_len - 1;
    char * p_dest = dest;
    while(p_source >= source){
        while((p_source >= source) && (inDiv(*p_source, divs))) p_source--;

this is the main program

#include <stdio.h>
#include <string.h>

int inDiv(char c, char * divs){
    char * p_divs = divs;
    char tmp;
    while(tmp = *p_divs++)
        if (tmp == c) return 1;
    return 0;
}

void reverse(char * source, char * dest, char * divs){
    *dest = '\0';
    int source_len = strlen(source);
    if (source_len == 0) return;
    char * p_source = source + source_len - 1;
    char * p_dest = dest;
    while(p_source >= source){
        while((p_source >= source) && (inDiv(*p_source, divs))) p_source--;
        if (p_source < source) break;
        char * w_end = p_source;
        while((p_source >= source) && (!inDiv(*p_source, divs))) p_source--;
        char * w_beg = p_source + 1;
        for(char * p = w_beg; p <= w_end; p++) *p_dest++ = *p;
        *p_dest++ = ' ';
    }
    *p_dest = '\0';
}

#define MAS_SIZE 100

int main(){
    char source[MAS_SIZE], dest[MAS_SIZE], divs[MAS_SIZE];
    printf("String          : "); gets(source);
    printf("Dividers        : "); gets(divs);
    reverse(source, dest, divs);
    printf("Reversed string : %s", dest);
    return 0;  
}

"my quest is what does this code do?" Given that it uses `gets`, the answer is: it endangers your system by its very existence. At least IMO, asking somebody to analyze this code is like giving somebody a 5-gallon bucket of sewage, and asking where where it all came from. The code is an ugly mess. If you want to know how to do the tasks it supposedly does, you should ask that directly, so somebody can explain a decent way to do it. This code should be ignore except (perhaps) as an example of things to avoid. — Jerry Coffin, Apr 16 '12 at 05:42
@JerryCoffin: Wow, that's pretty harsh. I've seen a lot of far worse code. — Michael Burr, Apr 16 '12 at 05:54
@MichaelBurr: I've seen worse too, but to read words and print them in reverse, this is still a *lousy* starting point. It's *much* easier to understand the task by starting from a clean slate. — Jerry Coffin, Apr 16 '12 at 05:58
@JerryCoffin: I dunno - it basically starts from the end of the source string, locates words and copies them to the dest as it finds them - pretty straightforward, and overall how most C programs might approach the problem. There are certainly problems with the code, but other than the technical UB of decrementing a pointer beyond the start of the buffer (which a lot of production code gets away with, too), at first glance it appears to actually work and be relatively bug-free. — Michael Burr, Apr 16 '12 at 06:11
And as far as using `gets()` is concerned, clearly this is a throw away test-driver `main()` to demonstrate the `reverse()` function - one place where I think using `gets()` should still be acceptable (how bullet-proof does something like that need to be?). — Michael Burr, Apr 16 '12 at 06:15
@MichaelBurr: You must have a different definition of "production code" than I do if it includes decrementing before the beginning of an array. IMO, "sewage" is a nice description of that -- I tend to think of it as the specific type of sewage I contribute while sitting on the toilet. I'll repeat for emphasis: there are cleaner, easier ways to do this. Oh, and IMO, there's no excuse for using `gets` anywhere, ever. — Jerry Coffin, Apr 16 '12 at 06:20
By 'production code', I mean that I've seen it in code that's been released in a product. I don't mean that it's *acceptable* for production code. But, I'm sure you'd agree that virtually all production code of any significant size contains defects. — Michael Burr, Apr 16 '12 at 06:30
So what relevance does "production" have to anything then? I don't see how one piece of horrible code (published or otherwise) excuses another). — Jerry Coffin, Apr 16 '12 at 06:40
`gets` is no longer part of the C language, since C11 was introduced 4 months back. You should use `gets_s` instead, or if you don't have a C11 compiler, `fgets`. — Lundin, Apr 16 '12 at 06:57
@Jerry: that particular oversight could be corrected reasonably easily - why not suggest a fix instead of throwing out the baby with the bathwater? The code is otherwise par for the course for C. If you're going to criticise, offer an alteranative and we can see the concision/elegance/performance compromises you happen to prefer... there will be some. — Tony Delroy, Apr 16 '12 at 07:29
@TonyDelroy: Changing to `fgets` would be trivial, yes. Fixing the rest, much less so. If I had to suggest something better, I simply wouldn't use C -- it strikes me as a poor tool for the job. — Jerry Coffin, Apr 16 '12 at 08:17
@Jerry: and that's exactly what I thought when I first saw your comment - "he's used to C++ - of course this is going to look ugly" ;-) — Tony Delroy, Apr 16 '12 at 09:17
@TonyDelroy: While I can't call it a thing of great beauty, I think what I posted in my answer is a least a whole lot cleaner. — Jerry Coffin, Apr 16 '12 at 13:36

Tony Delroy · Accepted Answer · 2012-04-16T09:20:03.253

Here, inDiv can be called to search for the character c in the string divs, for example:

inDiv('x', "is there an x character in here somewhere?') will return 1
inDiv('x', "ahhh... not this time') will return 0

Working through it:

int inDiv(char c, char * divs)
{
    char * p_divs = divs;    // remember which character we're considering
    char tmp;
    while(tmp = *p_divs++)   // copy that character into tmp, and move p_divs to the next character
                             // but if tmp is then 0/false, break out of the while loop
         if (tmp == c) return 1;  // if tmp is the character we're searching for, return "1" meaning found
    return 0;   // must be here because tmp == 0 indicating end-of-string - return "0" meaning not-found
}

We can infer things about reverse by looking at the call site:

int main()
{
    char source[MAS_SIZE], dest[MAS_SIZE], divs[MAS_SIZE];
    printf("String          : ");
    gets(source);
    printf("Dividers        : ");
    gets(divs);
    reverse(source, dest, divs);
    printf("Reversed string : %s", dest);

We can see gets() called to read from standard input into character arrays source and divs -> those inputs are then provided to reverse(). The way dest is printed, it's clearly meant to be a destination for the reversal of the string in source. At this stage, there's no insight into the relevance of divs.

Let's look at the source...

void reverse(char * source, char * dest, char * divs)
{
    *dest = '\0'; //what does this pointer do?
    int source_len = strlen(source); //what is source
    if (source_len == 0) return;
    char* p_source = source + source_len - 1;
    char* p_dest = dest;
    while(p_source >= source)
    {
        while((p_source >= source) && (inDiv(*p_source, divs))) p_source--;

Here, *dest = '\0' writes a NUL character into the character array dest - that's the normal sentinel value encoding the end-of-string position - putting it in at the first character *dest implies we want the destination to be cleared out. We know source is the textual input that we'll be reversing - strlen() will set source_len to the number of characters therein. If there are no characters, then return as there's no work to do and the output is already terminated with NUL. Otherwise, a new pointer p_source is created and initialised to source + source_len - 1 -> that means it's pointing at the last non-NUL character in source. p_dest points at the NUL character at the start of the destination buffer.

Then the loop says: while (p_source >= source) - for this to do anything p_source must initially be >= source - that makes sense as p_source points at the last character and source is the first character address in the buffer; the comparison implies we'll be moving one or both towards the other until they would cross over - doing some work each time. Which brings us to:

while((p_source >= source) && (inDiv(*p_source, divs))) p_source--;

This is the same test we've just seen - but this time we're only moving p_source backwards towards the start of the string while inDiv(*p_source, divs) is also true... that means that the character at *p_source is one of the characters in the divs string. What it means is basically: move backwards until you've gone past the start of the string (though this test has undefined behaviour as Michael Burr points out in comments, and really might not work if the string happens to be allocated at address 0 - even if relative to some specific data segment, as the pointer could go from 0 to something like FFFFFFFF hex without ever seeming to be less than 0) or until you find a character that's not one of the "divider" characters.

Here we get some real insight into what the code's doing... dividing the input into "words" separated by any of a set of characters in the divs input, then writing them in reverse order with space delimiters into the destination buffer. That's getting ahead of ourselves a bit - but let's track it through:

The next line is...

if (p_source < source) break;

...which means if the loop exited having backed past the front of the source string, then break out of all the loops (looking ahead, we see the code just puts a new NUL on the end of the already-generated output and returns - but is that what we'd expect? - if we'd been backing through the "hello" in "hello world" then we'd hit the start of the string and terminate the loop without copying that last "hello" word to the output! The output will always be all the words in the input - except the first word - reversed - that's not the behaviour described by the author).

Otherwise:

char* w_end = p_source;  // remember where the non-divider character "word" ends

// move backwards until there are no more characters (p_source < source) or you find a non-divider character
while((p_source >= source) && (!inDiv(*p_source, divs))) p_source--;

// either way that loop exited, the "word" begins at p_source + 1
char * w_beg = p_source + 1;

// append the word between w_beg and w_end to the destination buffer
for(char* p = w_beg; p <= w_end; p++) *p_dest++ = *p;

// also add a space...
*p_dest++ = ' ';

This keeps happening for each "word" in the input, then the final line adds a NUL terminator to the destination.

*p_dest = '\0';

Now, you said:

according [to] the writer it writes a string of hello world then there is a function in it that also reverses the string to world hello

Well, given inputs "hello world" and divider characters including a space (but none of the other characters in the input), then the output would be "hello world " (note the space at the end).

For what it's worth - this code isn't that bad... it's pretty normal for C handling of ASCIIZ buffers, though the assumptions about the length of the input are dangerous and it's missing that first word....

** How to fix the undefined behaviour **

Regarding the undefined behaviour - the smallest change to address that is to change the loops so they terminate when at the start of the buffer, and have the next line explicitly check why it terminated and work out what behaviour is required. That will be a bit ugly, but isn't rocket science....

So if depending on undefined behavior (`if (p_source < source) break;`) isn't "that bad", what exactly *would* it take to be "that bad"? — Jerry Coffin, Apr 16 '12 at 08:16
That particular bit is totally bad - no question... I simply mean that - if that were fixed - the code overall is reasonably readable, concise without being overtly cryptic, maintainable and efficient. — Tony Delroy, Apr 16 '12 at 09:16

score 3 · Answer 2 · edited May 23 '17 at 12:04

char * p_divs = divs; //what does divs do
char tmp;
while(tmp = *p_divs++)
    if (tmp == c) return 1

divs is a pointer to a char array (certainly a string). p_divs just points to the same string and within the while loop a single character is extraced and written to tmp, and then the pointer is incremented meaning that the next character will be extraced on the next iterator. If tmp matches c the function returns.

Edit: You should learn more about pointers, have a look at Pointer Arithmetic.

Jerry Coffin · Answer 3 · 2012-04-17T03:53:09.037

As I pointed out in the comments, I don't think C is really the ideal tool for this task (given a choice, I'd use C++ without a second thought).

However, I suppose if I'm going to talk about how horrible the code is, the counter-comment really was right: I should post something better. Contrary to the comment in question, however, I don't think this represents a compromise in elegance, concision, or performance.

The only part that might be open to real argument is elegance, but think this is enough simpler and more straightforward that there's little real question in that respect. It's clearly more concise -- using roughly the same formatting convention as the original, my rev_words is 14 lines long instead of 17. As most people would format them, mine is 17 lines and his is 21.

For performance, I'd expect the two to be about equivalent under most circumstances. Mine avoids running off the beginning of the array, which saves a tiny bit of time. The original contains an early exit, which will save a tiny bit of time on reversing an empty string. I'd consider both insignificant though.

I think one more point is far more important though: I'm reasonably certain mine doesn't use/invoke/depend upon undefined behavior like the original does. I suppose some people might consider that justified if it provided a huge advantage in another area, but given that it's roughly tied or inferior in the other areas, I can't imagine who anybody would consider it (even close to) justified in this case.

#include <string.h>
#include <stdlib.h>

#include <stdio.h>

int contains(char const *input, char val) {
    while (*input != val && *input != '\0')
        ++input;
    return *input == val;
}

void rev_words(char *dest, size_t max_len, char const *input, char const *delims) {
    char const *end = input + strlen(input);
    char const *start;
    char const *pos;

    do {
        for (; end>input && contains(delims, end[-1]); --end);
        for (start=end; start>input && !contains(delims,start[-1]); --start);
        for (pos=start; pos<end && max_len>1; --max_len) 
            *dest++=*pos++;
        if (max_len > 1) { --max_len; *dest++ = ' '; }
        end=start;
    } while (max_len > 1 && start > input);
    *dest++ = '\0';
}

int main(){ 
    char reversed[100];

    rev_words(reversed, sizeof(reversed), "This is an\tinput\nstring with\tseveral words in\n     it.", " \t\n.");
    printf("%s\n", reversed);
    return 0;
}

Edit: The:

if (max_len > 1) { --max_len; *dest++ = ' '; }

should really be:

if (max_len > 1 && end-start > 0) { --max_len; *dest++ = ' '; }

If you want to allow for max_len < 1, you can change:

*dest++ = '\0';

to:

if (max_len > 0) *dest++ = '\0';

If the buffer length could somehow be set by via input from a (possibly hostile) user, that would probably be worthwhile. For many purposes it's sufficient to simply require a positive buffer size.

I hope that your discussion about the UB in the original code being 'justified' wasn't done with my comments in mind. My comments were not intended to say that the UB in the original was OK as is - just that it didn't make the code as posted deserving of being described a 'bucket of sewage'. The UB behavior in the original could be fixed with minor edits to several lines without a major restructuring. I just considered that a harsh description of the posted code; I didn't say the code was a model that did not need improvement. — Michael Burr, Apr 16 '12 at 15:13
They were mostly in reaction to Tony's claim that the original code was "not that bad". While the code can be hacked to prevent the UB, it's likely to further reduce the readability of code that's none too great to start with. Better to fix the root problem. — Jerry Coffin, Apr 16 '12 at 15:35
@Jerry: you know - from my perspective your solution's pretty similar to the original and if anything validates the general style and approach :-). Devil's in the detail. Your reversal yields two trailing spaces when the input starts with a "delims" character and has at least one token. You write a NUL when max_len is 0! ;-) But, let's get back to C++ and call it a truce :-). (Separately, I'd be tempted to suggest a single pass of delims populating an array indexed by character code - allowing O(1) delimiter detection - but that's less scalable to >8-bit characters). +1 anyway! — Tony Delroy, Apr 17 '12 at 02:20
I intentionally followed the original as closely as reasonable, including always writing the NUL (requiring size>0 seemed at least reasonably realistic). The doubled trailing space was an oversight though. I think enough delimiters to make O(1) detection worthwhile would be quite unusual. — Jerry Coffin, Apr 17 '12 at 03:46

working with strings in c

3 Answers3