Finding the shortest word in a C string

Question

I'm trying to find the shortest word in a given C string, but somehow it fails if there is only one word or if it's the last word.

I tried start at the null character and count backwards until I hit " " and than count the steps I took but it did not work properly.

Also is there a better or faster way to iterate through a string while finding the shortest or longest word?

#include <sys/types.h>
#include <string.h>
#include <stdio.h>

ssize_t find_short(const char *s) {
    int min = 100; 
    int count = 0; 

    for (int i = 0 ; i < strlen(s); i++) {
        if (s[i] != ' ') {
            count++;   
        } else {
            if (count < min) {
                min = count; 
            }
            count = 0; 
        }
    
        if (s[i] == '\0') {
            count = 0;
            while (s[i] != ' ') {
                --i; 
                count++; 
                if (s[i] == ' ' && count < min) {
                    min = count;  
                }
            }
        }
    }
    return min; 
}

dreamcrash · Answer 1 · 2021-01-04T17:49:01.030

Your idea was correct you just complicated a little. Let us break down your code:

int min = 100;

First you should initialized min to INT_MAX which you can get it from #include <limits.h>. Maybe all words are bigger than 100.

for (int i = 0 ; i < strlen(s); i++)

you can use the C-String terminal character '\0':

 for(int i = 0; s[i] != '\0'; i++)

The if and else part:

if (s[i] != ' ') 
     {
        count++;   
     }
     
    else
    {
      if (count < min)
       {
          min = count; 
       }
      count = 0; 
    }

almost correct, but you need to append count != 0 to the condition count < min to ensure that if the string starts with spaces, you do not want to count them as the smallest word.

This part can be removed :

 if (s[i] == '\0')
    { 
      count = 0;
       while(s[i] != ' ')
       {
         --i; 
         count++; 
         if(s[i] == ' ' && count < min) 
         {
           min = count;  
         }
        
       }
       
    }

check the last word outside the loop. Hence, your code would look like the following:

ssize_t find_short(const char *s)
{
    int min = INT_MAX;
    int count = 0; 
    // Iterate over the string  
    for(int i = 0; s[i] != '\0'; i++){
        if(s[i] != ' '){ // Increment the size of the word find so far
            count++;
        }
        else{ // There is a space meaning new word
            if(count != 0 && count < min){  // Check if current word is the smallest 
                min = count; // Update the counter
            }
            count = 0; // Set the counter 
        }
    } 
    return (count < min) ? count : min // Check the size of the last word
 }

Thank u for your efforts in explaining everything step by step and therefore making this very easy to understand. This helps me a lot! — rndnewbiee22, Jan 04 '21 at 17:50

chqrlie · Accepted Answer · 2021-01-05T00:04:12.770

There are multiple issues in your code:

mixing int, size_t and ssize_t for the index and string lengths and return value is confusing and incorrect as these types have a different range.
int min = 100; produces an incorrect return value if the shortest word is longer than that.
for (int i = 0 ; i < strlen(s); i++) is potentially very inefficient as the string length may be recomputed at every iteration. for (size_t i = 0 ; s[i] != '\0'; i++) is a better alternative.
find_short returns 100 for the empty string instead of 0.

Scanning backwards from the end is tricky and not necessary: to avoid this special case, omit the test in the for loop and detect the end of word by comparing the character with space or the null byte, breaking from the loop in the latter case after potentially updating the minimum length.

The initial value for min should be 0 to account for the case where the string is empty or contains only whitespace. Whenever a word has been found, min should be updated if it is 0 or if the word length is non zero and less than min.

Here are an implementation using <ctype.h> to test for whitespace:

#include <stddef.h>
#include <ctype.h>

size_t find_short(const char *s) {
    size_t min = 0, len = 0;

    for (;;) {
        unsigned char c = *s++;
        if (isspace(c) || c == '\0') {
            if (min == 0 || (len > 0 && len < min))
                min = len;
            if (c == '\0')  // test for loop termination
                break;
            len = 0;
        } else {
            len++;   
        }
    }
    return min; 
}

Here is a more general alternative using the functions strcspn() and strspn() from <string.h> where you can define the set of word separators:

#include <string.h>

size_t find_short(const char *s) {
    size_t min = 0;
    const char *seps = " \t\r\n";  // you could add dashes and punctuation

    while (*s) {
        s += strspn(s, seps);
        if (*s) {
            size_t len = strcspn(s, seps);
            if (min == 0 || (len > 0 && len < min))
                min = len;
            s += len;
        }
    }
    return min; 
}

anotherOne · Answer 3 · 2021-01-05T10:26:52.060

1

As you said your program fails if the shortest word is the last one. That's because the last i for which the loop runs is i == len-1 that is the last letter of the string. So in this last lap, count will be incremented, but you will never check if this count of the last word was smaller than the min you had so far.

Assuming that you receive a null-terminated string, you could extend the loop till i <= len (where len = strlen(s)) and adjust the if condition to

if( s[i] != ' ' && s[i] )

That means: if s[i] is not a space nor the terminating null character

Also you can remove the condition if (s[i] == '\0').

About faster algorithms, I don't think it's possible to do better.

If you want you can automate the count increment using an inner empty for loop running till it finds a space and then in the outer for check for how long the innermost have been running.

I once wrote a program for the same problem which uses an inner for, I show you just for the algorithm, but don't take example from the "style" of the code, I was trying to make it as few lines as possible and that's not a good practice.

ssize_t find_short(const char *s)
{
    ssize_t min = 99, i = 0;
  
    for( --s; !i || (min > 1 && *s); s += i) {
        for(i = 1; *(s+i) != ' ' && *(s+i); i++);
        if( min > i-1 ) min = i-1;
    }
  
    return min;
}

Oh, one improvement I just noticed in my code could be to return the min when it reaches 1 because you know you are not going to find shorter words.

edited Jan 05 '21 at 10:26

answered Jan 04 '21 at 18:52

anotherOne

1,513
11
20

`min = 99` ? What if the shortest word is longer than that? – chqrlie Jan 04 '21 at 18:57
In that case the function would return 99 – anotherOne Jan 04 '21 at 19:03
this behavior is not specified and seems incorrect, just as returning `99` for the empty string. – chqrlie Jan 04 '21 at 19:47
You are right, I am not going to argue on that. However there's no language that has a word so long to be close to 99 characters and I guess that an empty string would be a meaningless input about which the caller function should complain before to check the shortest word. But as I said you are right on pointing out these limitations. – anotherOne Jan 04 '21 at 21:22
A *word* is just defined as a sequence of characters delimited by spaces, what if the *words* are genomic sequences? Making assumptions about a maximum length is unnecessary. Returning `0` for the empty string or one that contains just whitespace is the only logical behavior. It unambiguously tells the caller that the string contains no *word*. – chqrlie Jan 04 '21 at 21:26
1

Well if the function is going to be used for that very particular *genomic* purpose, than it's worthy to use `INT_MAX`. For the the empty string if you really don't want to check in the caller function, you can always add a line in the called function doing the check before to even declare `min` etc... However when I wrote this piece of code I was aiming (just for fun) to write it in the fewest lines possible, and adding an header and a condition for the empty string would result in a 28.57% increase on the number of lines (initially just 7). I know, it's stupid to look at the number of lines. – anotherOne Jan 04 '21 at 21:42
I'm afraid the function has undefined behavior because it accesses `s[-1]`, it does not seem to handle multiple consecutive space characters correctly and it returns `0` if the string starts with a space :( – chqrlie Jan 04 '21 at 21:51
Skipping leading spaces would be an easy fix, but requires another line of code. About the `s[-1]` initially I was concerned as well, however it's part of an `&&` comparison that can only return either `1` or `0`, so after all the behaviour should be kind of defined, isn't it? – anotherOne Jan 04 '21 at 21:58
no the behavior is undefined as reading `s[-1]` may access an invalid address. The fix is trivial: use `!i || (min > 1 && *(s))` – chqrlie Jan 04 '21 at 22:00
Oh you are right, but in this way `!i` would be parsed every time, while in the other way only if `min <= 1 || !(*s)`. If `s[-1]` happens to be an invalid address the returned value of `&&` could be different from 1 or 0? – anotherOne Jan 04 '21 at 22:08
I know that questioning the undefined behaviour of `s[-1]` to avoid a condition to be parsed is mental. It's just that this `!i` would be always `false` (and so useless) except for the first run of the loop. I believed I found a way to force the first run without adding any weight to the rest of the loop, (and I also thought to have harnessed the mystic and dreaded *undefined behaviour*) but now you are telling me that I was deluded. – anotherOne Jan 04 '21 at 22:44
1

If `s[-1]` is an invalid address, you get undefined behavior, such as a segmentation fault, which terminates the program... – chqrlie Jan 04 '21 at 23:43
All right, thank you. I will not mess with the *undefined behaviour* ever again. – anotherOne Jan 05 '21 at 10:26

score 0 · Answer 4 · answered Jan 04 '21 at 17:15

0

I'd suggest that you use strtok to split your string into an array of strings using space character as the token. You can then process each string in the resulting array to determine which is the "word" that you want.

See Split strings into tokens and save them in an array

answered Jan 04 '21 at 17:15

ChrisBD

9,104
3
22
35

2

`strtok()` will modify the string, this is inappropriate. – chqrlie Jan 04 '21 at 18:58
And? You can always copy the string before using `strtok` if you need to keep a copy of the original, but I don't see that as a requirement in the original post. – ChrisBD Jan 04 '21 at 23:09
`ssize_t find_short(const char *s)` seems a pretty clear requirement that the array pointed to by `s` must not be modified... Splitting the string into an array of tokens, mapping `len()` on this array, then reducing with `min` is a typical approach in Python and Javascript: somewhat elegant one-liners with abysmal performance. – chqrlie Jan 04 '21 at 23:59
I didn't spot `const`. Stating that in your comment would have made it clearer for anyone else reading this. – ChrisBD Jan 05 '21 at 08:47

Finding the shortest word in a C string

4 Answers4