C finding the shortest substring of the string s1 that contains all letters from s2

Question

The user inputs the strings s1 and s2. Find the smallest substring of the string s1 that contains all the letters of s2. (if there are more than one of equal size, find the first one that appears)

Example input:

it is raining today
dot

Output: tod

Note: I wrote a working code, but it took me too much time to figure out and write, and since I'll have such examples on a test, that isn't good. Is there a simpler way?

How I did it: I wrote two functions, one that returns a substring from index i to index j for a given string, and another to check whether the substring contains all letters. Then I used them to find the shortest substring that contains all the letters by using nested for loops.

My code (working):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

const int LEN = 100;

char *subString(int beg, int end, int n, char str[n])
{
    int resultLen = abs(end - beg), i;
    char *result = malloc(sizeof(char) * resultLen);

    for (i = 0; i < resultLen; i++)
    {
        result[i] = str[i + beg - 1];
    }

    return result;
}

int doesSubstrContainAllLetters(int substrLen, int lettersLen, char substr[substrLen], char letters[lettersLen])
{
    int i, j, result = 1;
    char *containtChar;

    for (i = 0; i < lettersLen; i++)
    {
        containtChar = strchr(substr, letters[i]);

        if (containtChar == 0)
        {
            result = 0;
        }
        else
        {
            *containtChar = '+';
        }
    }

    return result;
}

int main()
{
    char s1[LEN], s2[LEN];

    gets(s1);
    gets(s2);

    int s1Len = strlen(s1);
    int s2Len = strlen(s2);
    int res, min_substrLen = INT_MAX, substrLen, i, j, c;

    for (i = 0; i < s1Len - 1; i++)
    {
        for (j = i + 1; j < s1Len; j++)
        {

            char *substr = subString(i, j, s1Len, s1);
            substrLen = strlen(substr);

            res = doesSubstrContainAllLetters(strlen(substr), s2Len, substr, s2);

            if (res == 1)
            {
                min_substrLen = (substrLen < min_substrLen) ? substrLen : min_substrLen;
            }
        }
    }

    for (i = 0; i < s1Len - 1; i++)
    {
        for (j = i + 1; j < s1Len; j++)
        {
            char *substr = subString(i, j, s1Len, s1);
            substrLen = strlen(substr);

            res = doesSubstrContainAllLetters(strlen(substr), s2Len, substr, s2);

            if (res == 1 && substrLen == min_substrLen)
            {
                char *substr = subString(i, j, s1Len, s1);
                printf("%s", substr);
                return 0;
            }
        }
    }

    return 0;
}

From an experienced programmer: this is not a trivial task, and it is probably ok to take some hours, depending on your level. I actually know some 'senior programmers' that wouldn't be able to complete it error-free. How long did you take? — Aganju, Jan 11 '21 at 23:30
It took me 1 hour to the minute Edit: but afaik what I wrote was very inefficient code, but I don't actually care about that on a test, only the time it takes to write — David Ilic, Jan 11 '21 at 23:31
My question is more of a "is there a function in C that does most of the job for me" or something. — David Ilic, Jan 11 '21 at 23:33
Improvement of working code is very much on-topic at [Code Review](https://codereview.stackexchange.com) exchange. — user58697, Jan 11 '21 at 23:49
If _needle_ was `"dott"`, should the _haystack_ need 1 or 2 `t`? — chux - Reinstate Monica, Jan 12 '21 at 00:09
Fast substring search is a big part of computer science and regular expressions. Read about Boyer Moore. And others! Maybe read https://stackoverflow.com/a/3183711/13422 — Zan Lynx, Jan 12 '21 at 00:17
Although I just noticed that your "needle" can appear in any order which makes it different. — Zan Lynx, Jan 12 '21 at 00:18
No, there isn't a function that does most of the job for you. A different approach to the problem is to use a histogram, and a sliding window. — user3386109, Jan 12 '21 at 00:22
@chux-ReinstateMonica: from the OP's code, if `s2` contains repeated characters, the substring from `s1` must also have them repeated to be a match. This may not be the actual specification, but that's what the implementation does. — chqrlie, Jan 12 '21 at 00:23

chqrlie · Answer 1 · 2021-01-12T00:52:12.770

Your code has some shortcomings:

you should not use gets().
you allocate a lot memory and never free it.
it does not work as expected if s2 contains one or more + characters

There is no function in the C library that does something close to what you need.

The problem is not so easy to solve, it took me 40 minutes to get an implementation that I believe is sturdy, albeit not very fast. It does not allocate memory, but assumes bytes have 8 bits.

Here is my code:

#include <stdio.h>
#include <string.h>

const char *minsubstr(const char *str, const char *set, size_t *plen) {
    size_t count[256] = { 0 }, cc[256];
    size_t i, n, set_len, best_len = -1;
    const char *best = NULL;
    for (i = 0; set[i]; i++) {
        unsigned char c = set[i];
        count[c]++;
    }
    set_len = i;
    if (set_len == 0) {
        best_len = 0;
        best = str;
    } else {
        for (; *str; str++) {
            if (count[(unsigned char)*str] == 0)
                continue;
            memcpy(cc, count, sizeof cc);
            for (i = n = 0; i < best_len && str[i]; i++) {
                unsigned char c = str[i];
                if (cc[c]) {
                    cc[c]--;
                    if (++n == set_len) {
                        if (best_len > i + 1) {
                            best_len = i + 1;
                            best = str;
                        }
                        break;
                    }
                }
            }
            if (!str[i]) {
                // no more matches
                break;
            }
        }
    }
    *plen = best_len;
    return best;
}

int main() {
    char s1[100], s2[100];
    const char *p;
    size_t len;

    if (fgets(s1, sizeof s1, stdin) && fgets(s2, sizeof s2, stdin)) {
        s1[strcspn(s1, "\n")] = '\0';  // strip the trailing newline
        s2[strcspn(s2, "\n")] = '\0';  // strip the trailing newline if any
        p = minsubstr(s1, s2, &len);
        if (p) {
            printf("%.*s\n", (int)len, p);
        } else {
            printf("no match\n");
        }
    }
    return 0;
}

Here is an alternative approach that does allocate some memory but should be faster for small s2 strings:

#include <stdlib.h>
#include <string.h>

const char *minsubstr(const char *str, const char *set, size_t *plen) {
    size_t i, len, set_len = strlen(set), best_len = -1;
    const char *best = NULL;
    if (set_len == 0) {
        best_len = 0;
        best = str;
    } else {
        char *buf = malloc(set_len);
        for (; *str; str++) {
            if (!memchr(set, *str, set_len))
                continue;
            memcpy(buf, set, len = set_len);
            for (i = 0; i < best_len && str[i]; i++) {
                char *p = memchr(buf, str[i], len);
                if (p != NULL) {
                    *p = buf[--len];
                    if (len == 0) {
                        if (best_len > i + 1) {
                            best_len = i + 1;
                            best = str;
                        }
                        break;
                    }
                }
            }
            if (!str[i]) {
                // no more matches
                break;
            }
        }
        free(buf);
    }
    *plen = best_len;
    return best;
}

Thanks. I'll analyze and learn from your code. I was aware of some of the shortcomings but I didn't really have time or care about them given the circumstances. — David Ilic, Jan 12 '21 at 00:33

user3386109 · Answer 2 · 2021-01-12T03:37:54.300

Here's a solution that uses histograms and a sliding window to find the best match. It assumes that only lower case letters are of interest. The histograms can be expanded to cover a different character set, if desired. It has no memory allocation, and runs in O(n) time. The first draft, which correctly identified "tod" as the correct output for the needle "dot" took me 31 minutes to write, including debug time.

#include <stdio.h>
#include <ctype.h>
#include <string.h>

char *findMinSubstring(char *haystack, char *needle, int *bestLength)
{
    int needleHistogram[26] = {0};
    int haystackHistogram[26] = {0};

    // create a histogram from the needle, keeping track of the number of non-zero entries in the histogram
    int count = 0;
    for (int i = 0; needle[i] != '\0'; i++)
        if (islower(needle[i]))
        {
            int c = needle[i] - 'a';
            needleHistogram[c]++;
            if (needleHistogram[c] == 1)
                count++;
        }

    // now look for the best substring using a sliding window
    int start = 0;
    int end = 0;
    int length = (int)strlen(haystack);
    int bestStart = -1;
    int bestEnd = length+1;
    for (;;)
    {
        if (end < length && count != 0)
        {
            // if the window doesn't contain all of the necessary letters, enlarge it by advancing the end
            if (islower(haystack[end]))
            {
                int c = haystack[end] - 'a';
                haystackHistogram[c]++;
                if (needleHistogram[c] > 0 && haystackHistogram[c] == needleHistogram[c])
                    count--;
            }
            end++;
        }
        else if (start < end && count == 0)
        {
            // if the window contains all of the necessary letters, shrink it by advancing the start
            if (islower(haystack[start]))
            {
                int c = haystack[start] - 'a';
                haystackHistogram[c]--;
                if (needleHistogram[c] > 0 && haystackHistogram[c] == needleHistogram[c]-1)
                    count++;
            }
            start++;
        }
        else
        {
            // if expanding or shrinking the window isn't an option, then we're done
            break;
        }

        // if the window contains all the necessary letters, and is smaller than the previous best, update the best
        if (count == 0 && (end - start) < (bestEnd - bestStart))
        {
            bestStart = start;
            bestEnd = end;
        }
    }

    if (bestStart >= 0 && bestEnd <= length)
    {
        // if a matching substring exists, return the length and a pointer to the beginning of the substring
        *bestLength = bestEnd - bestStart;
        return haystack + bestStart;
    }
    else
    {
        // failed, return NULL
        *bestLength = 0;
        return NULL;
    }
}

int main(void)
{
    char haystack[] = "it is raining today";
    char *needle[] = { "dot", "dott", "dotti", "it", "today", "i", "ii", "iii", "iiii", "iiiii", "y", "yy", "end", NULL };
    for (int i = 0; needle[i] != NULL; i++)
    {
        int bestLength = 0;
        char *bestString = findMinSubstring(haystack, needle[i], &bestLength);
        printf("%-5s ", needle[i]);
        if (bestString != NULL)
            printf("'%.*s'\n", bestLength, bestString);
        else
            printf(" No matching substring\n");
    }
}

main has a variety of test cases, including the test case from the question. The output from the program is:

dot   'tod'
dott  't is raining tod'
dotti 't is raining tod'
it    'it'
today 'today'
i     'i'
ii    'ini'
iii   'is raini'
iiii  'it is raini'
iiiii  No matching substring
y     'y'
yy     No matching substring
end    No matching substring

Nicely done. Rather than a lower case histogram, could simple use a `size_t needleHistogram[UCHAR_MAX+1]` — chux - Reinstate Monica, Jan 12 '21 at 03:04

score 0 · Answer 3 · answered Jan 12 '21 at 00:54

Perhaps avoid nested for loops and the O(n^2) time that those cost, by using sorted strings:

Lexicographically sort s2 (call this sorted string: s2S).
Tokenize s1 by splitting on whitespace.
Loop through tokens from s1.
Keep the original token (s1T) and a lexicographically-sorted version of the same token (s1S).
Look through s1S for a copy of s2S.
If there is a match, then you have a substring in s1T that contains the characters of s2.
Loop through each letter of s1T, until you find one of the starting characters in s2S. Compare the next character, and so on, until you have a substring or a mismatch. If all characters match, you have a candidate hit.

This reduces your time from your O(n^2) to O(nlogn), when using sorted strings. Looping through characters in strings is O(n).

C finding the shortest substring of the string s1 that contains all letters from s2

3 Answers3