0

I'm trying to parse an http request. I want to get the hostname and port. I'm left with one of the "/" when using "//" as a delimiter and can't figure out why, as the code above works exactly the same (lines 57-59 vs. 65-69). I'm getting a hostname "www.example.com" (correct) for req1 but "/www.example.com" for req2. Shouldn't they be the same?

int parse_request(const char *request, char *method,
    char *hostname, char *port, char *uri) {
if (!is_complete_request(request)) {
    return 0;
}
char reqCpy[strlen(request)];
strcpy(reqCpy, request);
strcpy(method, strtok(reqCpy, " "));
strtok(NULL, "//");

char* url = strtok(NULL, " ");
printf("URL is %s\n", url);
if (strstr(url, ":") == NULL) {
    char newCpy[strlen(request)];
    strcpy(newCpy, request);
    strtok(newCpy, "//");
    char* name = strtok(NULL, "/");
    strcpy(hostname, name);

    strcpy(port, "80");
} else {
    char newCpy[strlen(request)];
    strcpy(newCpy, request);
    strtok(newCpy, "//");
    char* name = strtok(NULL, ":");
    strcpy(hostname, name);

    strcpy(port, strtok(NULL, "/"));
}
//printf("URI: %s", uri);

return 1;
}

req1 = "GET http://www.example.com/index.html HTTP/1.0\r\n" "Host: www.example.com\r\n" "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0\r\n" "Accept-Language: en-US,en;q=0.5\r\n\r\n";

req2 = "GET http://www.example.com:8080/index.html?foo=1&bar=2 HTTP/1.0\r\n" "Host: www.example.com:8080\r\n" "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0\r\n" "Accept-Language: en-US,en;q=0.5\r\n\r\n";

J Olson
  • 167
  • 2
  • 8
  • Output for req2 hostname should match req1 hostname -> "www.example.com" . Code is nearly identical, I don't understand why they might be different. – J Olson Mar 18 '21 at 02:13
  • 1
    `char reqCpy[strlen(request)]; strcpy(reqCpy, request);` That is a buffer overlfow as `reqCopy` does not have space for the string NUL terminator. Allocate one extra byte for the array. Same for the other `strcpy` calls. – kaylum Mar 18 '21 at 02:20
  • Good tip, didn't fix the issue though. Tried again with a significantly larger buffer size as well -- no luck. – J Olson Mar 18 '21 at 02:30
  • 1
    You need to provide exact, complete and minimal code that can reproduce the problem. See: [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Note that we don't want your full code. Just the minimum needed to repro the problem. For example, just hard code some inputs and call the function. As it is shown it is not clear what the function inputs are and the output shown doesn't even match any of the print statements in the code. – kaylum Mar 18 '21 at 02:40
  • 1
    `strtok(newCpy, "//");` does not use `//` as a delilmiter. The second argument to strtok is a set of characters, any one of which is a delimiter. So `strtok(newCpy, "//");` is semantically identical to `strtok(newCpy, "/");` (although it will probably be slower). – rici Mar 18 '21 at 03:44
  • We need to start using this canonical dupe more often: [How should character arrays be used as strings?](https://stackoverflow.com/questions/58526131/how-should-character-arrays-be-used-as-strings) However, it is not the only bug, so I'm not going to dupe hammer... – Lundin Mar 18 '21 at 16:06

2 Answers2

0

Various problems that lead to undefined behavior (UB) including

char reqCpy[strlen(request)];
strcpy(reqCpy, request);

reqCpy[] too small, by 1 to store a string of length strlen(request). Need 1 more.

char reqCpy[strlen(request) + 1];
strcpy(reqCpy, request);

At least fix that, else rest of code functionality is questionable.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0

That's the way strtok works. It's certainly not the most convenient interface in the world, but at least it is reasonably well-documented.

You seem to assume that strtok(…, "//"); is using the string "//" as a delimiter. But the second argument to strtok is not a single, possibly multicharacter delimiter. It's a set of possible single delimiter characters, any or all of which will be treated as delimiters. See the strtok manpage for details.

You'll find a lot of warnings about strtok floating around the web, many of them justified. But the "gotcha" in your code is one which doesn't get mentioned much, having to do with what happens when you call strtok twice with different delimiter sets.

To recap, strtok first skips delimiter characters, then skips non-delimiter characters, then overwrites the next delimiter character (if there is one) with a 0. It saves the address of the next character for a subsequent call.

That works as expected when the delimiter set doesn't change. And it works as expected on strings without repeated delimiters. But there is an oddity when the string has repeated delimiters and the delimiter set is changed for the second token. The unexpected result (in your case, a mysterious /) results from the fact that the first call to strtok only removes the first delimiter character following the token. If the token is followed by a sequence of delimiter characters, the next call to strtok needs to skip over the remainder of this sequence. Of course, it will do that if the delimiter set hasn't changed. But if you change the delimiter set for the second call, then what were considered delimiters from the first call are no longer consider delimiters in the second call, and they are not skipped, possibly contrary to expectations.

My guess is that strtok is not really the best standard library function for what you are trying to do. I'd suggest thinking about a combination of the following:

  • strstr to find multicharacter strings, like //.
  • strchr to find the next occurrence of a specific character.
  • strpbrk to find the next occurrence of one of a set of characters.
  • strspn and strcspn to find the length of a sequence of a set of characters (or the inverse of a set of characters).

Note that the return values of the first three functions differ from the return values of the last two. The first three functions give you a pointer to the substring/character being searched for, or NULL if the search fails. The last two functions count the number of characters up to a member of the set; if they don't find one, they just return the number of characters left in the string. (This behaviour is often easier to work with, which is why the functions were added to the C standard.)

Note: You also need to fix the buffer overruns identified by @chux. You must always be conscious of the need to allocate room for a string's terminating NUL (0). But if you rewrite the code to avoid using strtok, you might also find that you need fewer string copies, possibly even none.

rici
  • 234,347
  • 28
  • 237
  • 341