2

I'm trying to optimize a problem that I have to make it more readable with the same speed optimization. My problem consists in this:

Allowed function: write.c, nothing else.

Write a program that takes two strings and displays, without doubles, the characters that appear in either one of the strings.

The display will be in the order characters appear in the command line, and will be followed by a \n.

As you can see, in the main it will take two of your argument strings (argv[1] and argv[2]) into our function (void remove_dup(char *str, char *str2) after is compiling it with GCC. That temporary array will hold the ASCII value of the character after a duplicate is detected. For example, str1 = "hello" and str2 = "laoblc". The expected output will result as "heloabc" using the write function.

However, GCC was complaining because I have an array subscript with my temporary character array filled in with zeroes from the index of my strings. To stop making the compiler complaint, I had to cast the string index as an int to hold the ASCII value inside my temporary array. This will be our checker, which will determine if there is a duplicate in our string depending on the value of the character. Recompiling it again, but this time using warning flags: gcc -Wextra -Werror -Wall remove_dup.c. This is the error that I get:

remove_dup:11 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

           if (temp[str[i]] == 0)
                     ^~~~~~~

remove_dup:13 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

                   temp[str[i]] = 1;
                        ^~~~~~~

remove_dup:21 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

           if (temp[str2[i]]  == 0)
                   ^~~~~~~~

remove_dup.c:23 error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]

                  temp[str2[i]] = 1;
                      ^~~~~~~~

Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array? This program is running as O(m + n) where m is our first string and n is our second string.

This is the code:

void    remove_dup(char *str, char *str2)
{
    int temp[10000] = {0};
    int i;

    i = 0;
    while (str[i])
    {
        if (temp[(int)str[i]] == 0)
        {
            temp[(int)str[i]] = 1;
            write(1, &str[i], 1);
        }
        i++;
    }
    i = 0;
    while (str2[i])
    {
        if (temp[(int)str2[i]]  == 0)
        {
            temp[(int)str2[i]] = 1;
            write(1, &str2[i], 1);
        }
        i++;
    }
}

int main(int argc, char *argv[])
{
    if (argc == 3)
        remove_dup(argv[1], argv[2]);
    write(1, "\n", 1);
    return (0);
}

I hope this is clear enough with the logic structure I explained. I might have grammar mistakes, so bear with me :).

Zeid Tisnes
  • 396
  • 5
  • 22
  • 1
    AFAIK casting does not incur a runtime penalty. On a related note, `temp` only needs to be 256 elements long because that's how large `char` is. – Schwern Sep 17 '18 at 00:18
  • I agree with Schwen. Your code looks good. A slightly more elegant way to remove the annoying gcc warning is to replace `temp[(int)str[i]]` with `temp[+str[i]]`. This will also avoid the explicit cast; it is best to avoid casts whenever possible. – Joseph Quinsey Sep 17 '18 at 00:30
  • **Correction**: Unless your chars are `signed`! – Joseph Quinsey Sep 17 '18 at 00:33
  • @JosephQuinsey can you explain why chars must be signed in this case? – Zeid Tisnes Sep 17 '18 at 00:50

2 Answers2

2

Casting here will have no performance penalty.

However, as a rule of thumb, it is generally best to avoid explicit casts whenever possible. You can do this by for example by changing:

   temp[(int)str[i]]

to:

   temp[+str[i]]

This will work by the usual arithmetic conversions.

However, your code has another problem. You could ask: why would gcc bother to issue such an annoying warning message?

One answer is that they just like to be annoying. A better guess is that on most platforms char is signed-- see Is char signed or unsigned by default? --and so if your string happen to have an ASCII char greater than 127 (i.e. less than zero), you will have a bug.

One way to fix this is to replace:

   temp[(int)str[i]]

with:

   temp[str[i] + 128]

(and change int temp[10000] = {0} to int temp[256 + 128] = {0}). This will work regardless of the default sign of char.

Joseph Quinsey
  • 9,553
  • 10
  • 54
  • 77
  • As I was testing this, it will not complain. However, it will print both strings with the duplicates. if I make a change for `temp[str[i] + 128]` to `temp[+str[i]]` it will make it work. – Zeid Tisnes Sep 17 '18 at 01:38
  • Did you change the _two_ instances of `temp[(int)str[i]]` and the _two_ instances of `temp[(int)str2[i]]`? – Joseph Quinsey Sep 17 '18 at 01:41
  • Sorry if the above comment is obvious--I make dumb mistakes sometimes. – Joseph Quinsey Sep 17 '18 at 01:44
  • 1
    **Edit:** Regarding the OP's comment "This program is running as `O(m * n)` where `m` is our first string and `n` is our second string.", the time is in fact `O(m + n)`. – Joseph Quinsey Sep 17 '18 at 01:53
  • 1
    No worries, we all do to be honest. It works fine now after setting both instances to `temp[str[i] + 128]` or `temp[+str[i]]`. I thought to have one instance changed into `temp[+str[i]]` and the other to `temp[str[i] + 128]` will make the same asnwer for both. Edit: Yes, you are right in the `O(m + n)`. I will change it. – Zeid Tisnes Sep 17 '18 at 01:53
0

Now my real question is, how can I have the same time efficiency BUT without using any kind of casting into my array?

I don't believe casting in C has a runtime penalty. Everything in C is a number anyway. I believe it's just telling the compiler that yes, you know you're using the wrong type and believe it's ok.

Note that char can be signed. It is possible for a negative number to sneak in there.

This program is running as O(m * n) where m is our first string and n is our second string.

No, it's running as O(n). O(m*n) would be if you were iterating over one string for every character of the other.

for( int i = 0; i < strlen(str1); i++ ) {
    for( int j = 0; j < strlen(str2); j++ ) {
        ...
    }
}

But you're looping over each string one after the other in two independent loops. This is O(m + n) which is O(n).


On to improvements. First, temp only ever needs to hold the char range which is, at most, 256. Let's give it a variable name that describes what it does, chars_seen.

Finally, there's no need to store a full integer. Normally we'd use bool from stdbool.h, but we can define our own using signed char which is what stdbool.h is likely to do. We're sure to wrap it in an #ifndef bool so we use the system supplied one if available, it will know better than we do what type to use for a boolean.

#ifndef bool
  typedef signed char bool;
#endif
bool chars_seen[256] = {0};

You might be able to get a bit more performance by eliminating i and instead increment the pointer directly. Not only more performance, but this makes many string and array operations simpler.

for( ; *str != '\0'; str++ ) {
    if( !chars_seen[(size_t)*str] ) {
        chars_seen[(size_t)*str] = 1;
        write(1, str, 1);
    }
}

Note that I'm casting to size_t, not int, because that is the proper type for an index.

You might be able to shave a touch off by using post-increment, whether this helps is going to depend on your compiler.

    if( !chars_seen[(size_t)*str]++ ) {
        write(1, str, 1);
    }

Finally, to avoid repeating your code and to extend it to work with any number of strings, we can write a function which takes in the set of characters seen and displays one string. And we'll give the compiler the hint to inline it, though it's of questionable use.

inline void display_chars_no_dups( const char *str, bool chars_seen[]) {
    for( ; *str != '\0'; str++ ) {
        if( !chars_seen[(size_t)*str]++ ) {
            write(1, str, 1);
        }
    }
}

Then main allocates the array of seen characters and calls the function as many times as necessary.

int main(int argc, char *argv[]) {
    bool chars_seen[256] = {0};

    for( int i = 1; i < argc; i++ ) {
      display_chars_no_dups( argv[i], chars_seen );
    }
    write(1, "\n", 1);
}
Schwern
  • 153,029
  • 25
  • 195
  • 336