1

I'm currently doing a web server that is programmed in C. Currently I'm struggling because I don't know how to manage non ASCII characters well. What I mean is this:

Suppose I will enter this in the navigator bar: localhost:8080/ñndex.html

What I need is to handle correctly the path. If I print how I get it in the server I get

%C3%B1ndex.html

And it's ok, I get kind of a representation in UTF-8 (with kind I mean I only get C3 and B1). The problem though, is how do I convert it to something like

\xC3\xB1ndex.html

So that I can handle it and give the file ñndex.html to the client.

Craig Estey
  • 30,627
  • 4
  • 24
  • 48
MistBal
  • 13
  • 2
  • I voted to reopen because the questions are subtly different. The other one wants to decode an already parsed-out component and this one wants to get the path from the "path" component of an HTTP request. The loop termination is different and if you blindly decode then look for '?' your code has a subtle bug. – Joshua Aug 07 '21 at 20:58

1 Answers1

0

Don't get too clever. This is short enough of itself that it doesn't need a library but complex enough that we have to be careful to do it right. On the other hand I'm not telling you to not use a library. If you'd rather use one, use one. But library shopping is off topic.

Here's a routine that extracts the path (and only the path) from a URL.

I'm accustomed to passing arguments to server scripts still completely encoded and letting the script handle the decodes, so to extract the script portion, cut between the ? and the optional #. This is trivial. (The # really shouldn't be there but I've seen dumb things before.)

static int hexdigit(char c)
{
    return (c >= '0' && c <= '9')
        ? c - '0'
        : (c >= 'A' && c <= 'F')
            ? c - 'A' + 10
            : (c >= 'a' && c <= 'f')
               ? c - 'a' + 10
               : -1;
}

/* returns NULL on any error; check errno */
char *get_path(const char *url)
{
    size_t pathlen = 0;
    const char *s;
    while (*s = url; *s && *s != '?' && *s != '#') {
        ++pathlen;
        if (*s == '%') {
            if (hexdigit(s[1]) < 0 || hexdigit(s[2]) < 0) {
                  errno = EINVAL;
                  return NULL;
            }
            s += 2;
        }
        ++pathlen;
        ++s;
    }
    char *path == malloc(pathlen + 1);
    if (!path) return NULL;
    char *t = path;
    while (*s = url; *s && *s != '?' && *s != '#') {
        if (*s == '%') {
            *t = (hexdigit(s[1]) << 4) + hexdigit(s[2]);
            s += 3;
        } else if (s == '+')
            *t++ = ' ';
            ++s;
        } else {
            *t++ = *s++;
        }
    }
    *t = 0;
    return path;
}

Standard way of working in C: we make two passes, first pass validates the input, finds the end and measures the output space required, second pass generates the output.

Joshua
  • 40,822
  • 8
  • 72
  • 132