7

I'm currently writing a very simple web server to learn more about low level socket programming. More specifically, I'm using C++ as my main language and I am trying to encapsulate the low level C system calls inside C++ classes with a more high level API.

I have written a Socket class that manages a socket file descriptor and handles opening and closing using RAII. This class also exposes the standard socket operations for a connection oriented socket (TCP) such as bind, listen, accept, connect etc.

After reading the man pages for the send and recv system calls I realized that I needed to call these functions inside some form of loop in order to guarantee that all bytes are successfully sent/received.

My API for sending and receiving looks similar to this

void SendBytes(const std::vector<std::uint8_t>& bytes) const;
void SendStr(const std::string& str) const;
std::vector<std::uint8_t> ReceiveBytes() const;
std::string ReceiveStr() const;

For the send functionality I decided to use a blocking send call inside a loop such as this (it is an internal helper function that works for both std::string and std::vector).

template<typename T>
void Send(const int fd, const T& bytes)
{
   using ValueType = typename T::value_type;
   using SizeType = typename T::size_type;

   const ValueType *const data{bytes.data()};
   SizeType bytesToSend{bytes.size()};
   SizeType bytesSent{0};
   while (bytesToSend > 0)
   {
      const ValueType *const buf{data + bytesSent};
      const ssize_t retVal{send(fd, buf, bytesToSend, 0)};
      if (retVal < 0)
      {
          throw ch::NetworkError{"Failed to send."};
      }
      const SizeType sent{static_cast<SizeType>(retVal)};
      bytesSent += sent;
      bytesToSend -= sent;
   }
}

This seems to work fine and guarantees that all bytes are sent once the member function returns without throwing an exception.

However, I started running into problems when I began implementing the receive functionality. For my first attempt I used a blocking recv call inside a loop and exited the loop if recv returned 0 indicating that the underlying TCP connection was closed.

template<typename T>
T Receive(const int fd)
{
   using SizeType = typename T::size_type;
   using ValueType = typename T::value_type;

   T result;

   const SizeType bufSize{1024};
   ValueType buf[bufSize];
   while (true)
   {
      const ssize_t retVal{recv(fd, buf, bufSize, 0)};
      if (retVal < 0)
      {
          throw ch::NetworkError{"Failed to receive."};
      }

      if (retVal == 0)
      {
          break; /* Connection is closed. */
      }

      const SizeType offset{static_cast<SizeType>(retVal)};
      result.insert(std::end(result), buf, buf + offset);
   }

   return result;
}

This works fine as long as the connection is closed by the sender after all bytes have been sent. However, this is not the case when using e.g. Chrome to request a webpage. The connection is kept open and my receive member function is stuck blocked on the recv system call after receiving all bytes in the request. I managed to get around this problem by setting a timeout on the recv call using setsockopt. Basically, I return all bytes received so far once the timeout expires. This feels like a very inelegant solution and I do not think that this is the way web servers handles this issue in reality.

So, on to my question.

How does a web server know when an HTTP request have been fully received?

A GET request in HTTP 1.1 does not seem to include a Content-Length header. See e.g. this link.

François Andrieux
  • 28,148
  • 6
  • 56
  • 87
JonatanE
  • 941
  • 7
  • 19

4 Answers4

6

HTTP/1.1 is a text-based protocol, with binary POST data added in a somewhat hacky way. When writing a "receive loop" for HTTP, you cannot completely separate the data receiving part from the HTTP parsing part. This is because in HTTP, certain characters have special meaning. In particular, the CRLF (0x0D 0x0A) token is used to separate headers, but also to end the request using two CRLF tokens one after the other.

So to stop receiving, you need to keep receiving data until one of the following happens:

  • Timeout – follow by sending a timeout response
  • Two CRLF in the request – follow by parsing the request, then respond as needed (parsed correctly? request makes sense? send data?)
  • Too much data – certain HTTP exploits aim to exhaust server resources like memory or processes (see e.g. slow loris)

And perhaps other edge cases. Also note that this only applies to requests without a body. For POST requests, you first wait for two CRLF tokens, then read Content-Length bytes in addition. And this is even more complicated when the client is using multipart encoding.

Aurel Bílý
  • 7,068
  • 1
  • 21
  • 34
  • Thank you for your detailed response! I already knew about the two sets of CRLFs being used to signal the end of the request, maybe I should've specified that in the question. The key takeaway from your answer was that I need to keep receiving data until I find this delimiter in the byte-stream or exit early based on some other criteria. Turns out that my time-out idea wasn't so far off after all. – JonatanE Jan 08 '19 at 17:37
  • The two `CRLF`s do not signal the end of the **request**, they signal the end of the **request headers** only. There MAY OR MAY NOT be a message body following the headers. You have to parse the headers to determine not only IF a body is present, but also in WHAT FORMAT it is being sent in so that you read it correctly. The request ends at the end of the message body if one is present, ortherwise at the end of the headers. HOW you determine the end of the body depends on its transfer format. – Remy Lebeau Jan 08 '19 at 20:24
  • @RemyLebeau Yes, I agree. I put "note that this only applies to requests without a body." In general, you determine what type (method) of request you are dealing by parsing the headers, after receiving two `CRLF`. – Aurel Bílý Jan 08 '19 at 20:26
3

A request header is terminated by an empty line (two CRLFs with nothing between them).

So, when the server has received a request header, and then receives an empty line, and if the request was a GET (which has no payload), it knows the request is complete and can move on to dealing with forming a response. In other cases, it can move on to reading Content-Length worth of payload and act accordingly.

This is a reliable, well-defined property of the syntax.

No Content-Length is required or useful for a GET: the content is always zero-length. A hypothetical Header-Length is more like what you're asking about, but you'd have to parse the header first in order to find it, so it does not exist and we use this property of the syntax instead. As a result of this, though, you may consider adding an artificial timeout and maximum buffer size, on top of your normal parsing, to protect yourself from the occasional maliciously slow or long request.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 1
    Technically, Content length can be added to any request verb, including get. Technically, GET requests can include content. – Michael Chourdakis Jan 08 '19 at 14:46
  • @Michael It certainly can but it is without usefulness in that context. – Lightness Races in Orbit Jan 08 '19 at 14:46
  • @Remy Thanks for your edit. Every HTTP request works that way AFAIK. Are there some discrepancies? – Lightness Races in Orbit Jan 08 '19 at 23:40
  • @LightnessRacesinOrbit your original wording implied that *EVERY* HTTP request ends at the blank line following the headers, and that is simply not true. *MOST* HTTP requests have a message body after the headers (even if the body is 0 bytes). `GET` and `HEAD` requests end after the headers, as there is no body. Other requests end after the body instead. You have to analyze the headers of each request to know if a body is present, and how to read it. – Remy Lebeau Jan 08 '19 at 23:54
  • @RemyLebeau Okay, I guess I should have said the request _header_ terminates with an empty line. In my philosophy, the request is one thing, and a payload optionally follows, but YMMV. – Lightness Races in Orbit Jan 09 '19 at 00:46
  • @LightnessRacesinOrbit The request headers and request payload are separate pieces of a single message. The request is the whole message. Same with responses. – Remy Lebeau Jan 09 '19 at 01:49
  • @RemyLebeau Right, hence the "I should have said"! Anyway, I adjusted the wording now. – Lightness Races in Orbit Jan 09 '19 at 10:37
2

The solution is within your link

A GET request in HTTP 1.1 does not seem to include a Content-Length header. See e.g. this link.

There it says:

It must use CRLF line endings, and it must end in \r\n\r\n

urbanSoft
  • 686
  • 6
  • 14
1

The answer is formally defined in the HTTP protocol specifications 1:

So, to summarize, the server first reads the message's initial start-line to determine the request type. If the HTTP version is 0.9, the request is done, as the only supported request is GET without any headers. Otherwise, the server then reads the message's message-headers until a terminating CRLF is reached. Then, only if the request type has a defined message body then the server reads the body according to the transfer format outlined by the request headers (requests and responses are not restricted to using a Content-Length header in HTTP 1.1).

In the case of a GET request, there is no message body defined, so the message ends after the start-line in HTTP 0.9, and after the terminating CRLF of the message-headers in HTTP 1.0 and 1.1.

1: I'm not going to get into HTTP 2.0, which is a whole different ballgame.

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • I think RFC 7230 sec. 3.3 is entirely sufficient to answer this question. Not sure why you saw the need to quote the (as you know) outdated RFC 2616, thouch. Upvote since this should be the accepted answer. – DaSourcerer Jan 08 '19 at 20:30
  • @DaSourcerer many webservers have not been updated to implement RFCs 7230...7235 yet, they still implement RFC 2616. Although RFCs 7230-7235 are *mostly* just a restructuring of RFC 2616 to break it up, they also do [make a number of changes](https://tools.ietf.org/html/rfc7230#appendix-A.2) to the protocol, too (like deprecating header folding, and expanding on how a message length is determined). That is why I mention both sets of RFCs for HTTP 1.1. – Remy Lebeau Jan 08 '19 at 20:49
  • Each header is separated by CRLF, how do you know which CRLF is a terminating one? – EntityinArray Apr 10 '20 at 08:07
  • @EntityinArray read the specs I linked to. Yes, each header is terminated by a `CRLF`, but then after the headers are finished, there is another lone `CRLF` by itself. In other words, the headers are terminated by a `CRLF CRLF` pair. – Remy Lebeau Apr 10 '20 at 09:25