4

I am trying to understand the basics of the internals of HTTP servers and clients with regards to how they transmit data. I have read many articles about how HTTP works but I haven't found any that answers some of my questions. I would like to go trough the process of loading a web page as I understand it and I would appreciate if you make me notice where I got it wrong.

  1. When I visit a site my browser asks for an HTML file to a server, for that my browser creates a socket, binds it to my ip adress, and connects it to a listening socket of the server of the site I am visiting. In order to connect my browser's socket to the server I need a port number and a hostname, the port number is 80 because this is HTTP and the hostname is obtained via DNS resolution. Now that there is a connection between sockets my browser sends a GET request. That request is an ASCII file with the contents corresponding to an HTTP request. My browser writes the ASCII raw bytes to the socket and that is written to the server's socket.

  2. The server writes back the HTML file I requested to the socket. The HTML the server sends is just an ASCII file that the server will write byte by byte to the socket.

  3. My browser recieves the ASCII file and parses it. Lets assume here that it finds an image tag. The browser sends an HTTP request for that image file. Here comes something I don't understand. How does the server respond? As far as I can tell the server must send back an ASCII file formed by a set of headers followed by a CRLF and then the body of the message. In this case, assuming my browser asked for a .jpeg, does the server write the headers as ASCII plaintext to the socket and then writes the raw bytes of the image to the socket?

  4. If the HTML file has several images do we open a socket per image (per request)?

  5. Lets assume that my browser now finds a javascript tag. When the server answers to my request for that script does the server writes the ASCII bytes of the source of the script to the socket? What happens with js libraries? Does the server have to send all the source code for each one?

  6. On writing data to the sockets: is write(2) the correct way to do all this writing between sockets?

  7. On the transmission of large files: if I click a button on the site that lets me download a large PDF, how is this accomplished by the server? I assume that the server tries to transmit this in pieces. As far as I can tell there is an option for chunked encoding. Is this the way? If it is, is the file divided into chunks, and these are appended to the ASCII response and written byte by byte into the socket?

Finally, how is video transmitted? I know video encoding and transmission would require entire books for a detailed explanation but if you could say something about the generalities of video transmission (for example in youtube) I would appreciate it.

Anything that you could say about HTTP on the socket level would be appreciated. Thanks.

Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
jim79
  • 41
  • 2

3 Answers3

2

All my answers below relate to HTTP/1.1, not HTTP/2:

3.-My browser recieves the ASCII file and parses it. Lets assume here that it finds an image tag. The browser sends an HTTP request for that image file. Here comes something I don't understand. How does the server respond? As far as I can tell the server must send back an ASCII file formed by a set of headers followed by a CRLF and then the body of the message. In this case, assuming my browser asked for a .jpeg, does the server write the headers as ASCII plaintext to the socket and then writes the raw bytes of the image to the socket?

yes it does, usually. It's possible that it's encoded in a different format (gzip, brotli) or it might be chunked if a Content-Length was not set.

4.- If the HTML file has several images do we open a socket per image (per request)?

In HTTP/1 modern browsers will open up to 6 sockets per host but not more. If there's more than 6 requests going to the same host, it will wait until the other responses have been received.

5.- Lets asume that my browser now finds a javascript tag. When the server answers to my request for that script does the server writes the ASCII bytes of the source of the script to the socket? What happens with js libraries? Does the server have to send all the source code for each one?

Usually yes, you need 1 http request per javascript file. There's some server-side tools that combine javascript sources along with their dependencies in a single javascript 'file'. Note that javascript sources are typically UTF-8, not ASCII.

6.- On writing data to the sockets: is write(2) the correct way to do all this writing between sockets?

Dunno! Not a C guy

7.- On the transmition of large files: if I click a button on the site that lets me download a large PDF, how is this accomplished by the server? I assume that the server tries to transmit this in pieces. As far as I can tell there is an option for chunked encoding. Is this the way? If it is, is the file divided into chunks, and these are appended to the ASCII response and written byte by byte into the socket?

No, chunked is used for HTTP responses for which the content-length is not known ahead of time. The 'splitting up' you're talking about is done on a IP/TCP level, not at the HTTP protocol level. From a HTTP perspective it's just one continuous stream.

Finally, how is video transmited? I know video encoding and transmition would require entire books for a detailed explanaition but if you could say something about the generalities of video transmition (for example in youtube) I would appreciate it.

Too broad for me to answer.

Evert
  • 93,428
  • 18
  • 118
  • 189
1

It is highly recommended to read High-Performance Browser Networking.

About HTTP

HTTP is a message structuring protocol. It can be built on top of TCP/IP, or UDP, or any other communication protocol.

IP solves the problem of figuring out which computer in a network a message is meant to get to, and TCP solves the problem of ensuring the message gets received despite noise interfering. UDP does what TCP does, but without some important guarantees that make it better in some situations, such as video streaming.

HTTP only solves the problem of what the messages should look like so everyone can understand what you mean. An HTTP message consists of a header and a body. The body is the message you want to send; the header contains meta-information about the status of the message itself. HTTP lets you structure your applications in a meaningful, context-oriented way through a standard set of terms.

For example, you can communicate character encodings of your body with HTTP, how long your content is, whether you are okay with receiving it in a compressed format, and so on and so forth. So, no, HTTP is not limited to ASCII texts - you can send UTF-8 encoded characters with BOM markings, or not even specify an encoding at all. All HTTP does is let you ask for things in the way you want it, and inform recipients how you've packaged a message.

The actual thing responsible for handling how your messages are sent rather than structured are TCP/IP and UDP. HTTP has nothing to do with it. Both TCP/IP and UDP add overhead, but are well worth it so that communication can pass through unimpeded.

About Sockets

Computers listen on "sockets", which is just a fancy name to refer to a communication channel. It does not matter what a socket is - it is just a generic name used to refer to a communication channel, be it a wire or a wireless radio. All that matters is what a socket can do. Computers can send bytes down a socket (called flushing), and can read bytes sent through a socket. Sockets always carry a certain amount of memory reserved for incoming messages (like an inbox) called a buffer, and can even bundle many messages together and send them together in one shot to save time.

Sockets at the hardware level usually devolve to a network card, which lets you talk to wireless network, or to an Ethernet cable. Note that the computer may have many more sockets than cables - this is because a socket is a generic name for a single communication channel, and a single network/ Ethernet card can handle multiple communication channels. Being able to handle multiple channels at once is called multiplexing.

TCP/IP and UDP are just blueprints - it is the responsibility of the operating system to actually do as they lay out, and most OSs have some program designed to implement these standards. At the software level, how information is read and written becomes slightly more complicated than just passing bytes since a computer must also be able to interrupt its running programs when a hardware event happens, including while communicating from a socket - here is a reference for how the Linux kernel implements TCP/IP.

All operating systems expose a set of calls to start listening to (bind) a socket, read a socket and write to a socket. You can read from a socket in multiple ways, however. These range from the basic select() and [poll()] in most Linux distributions, which force the program to wait until all the data requested for has been received and then read it, to epoll() in Linux as well, which enables a program to ask to be notified when data has been received before having to read it.

Windows exports a completely different set of system calls, so you would be well advised to consult a reference manual for the same if planning to build applications for Windows.

About TCP/IP

TCP/IP is a combination of two protocols that has mostly become the norm for ensuring reliable communication.

IP is responsible for the term IP address. Every computer has a unique address associated with it, specified as either a 32-bit number (IPv4) or a 128-bit number (IPv6, or IP version 6). Note that these addresses do not exist outside of a network: a network is just a collection of computers, and a computer's address only makes sense within that collection. The network that the computer comes from is part of the IP address of a computer; the network itself is given a unique address; and a network may be composed of multiple networks. The IP protocol introduces the concept of a port, which is essentially synonymous with the concept of a socket.

I'm just tossing about the term 'network' willy-nilly as an abstract concept, but physically it boils down to a router. A router is a special computer responsible for figuring out who is being referenced to in a message using the IP address attached to the message, for assigning IP addresses to computers it is aware of (a network is quite literally the set of computers the router knows about), and for forwarding messages to other computers or routers. An internetwork (or just the Internet) is simply a bunch of routers, each with their own network, able to communicate to each other to form one giant network of connected networks. Effectively, a router implements the IP standard.

TCP and UDP are designed to solve another harrowing problem: how to ensure all of your messages get through. Sending any message down a shared communication channel like wireless or even wired channels organised like a bus topology is inherently messy - different messages can overlap, messages can be lost unexpectedly, messages can be corrupted and so on. TCP aims to solve these problems by guaranteeing all of a message goes through. On the other hand, UDP makes no such guarantees, and thus saves time by skipping a lot of steps TCP does.

TCP and UDP chunk the message into packets of a certain size, so that a message can be sent out as quickly as possible. TCP further adds some additional structure to the exchange called a three-way handshake:

  • It sends off a TCP-specific message called a SYN packet to the computer it wants to send a message to, and waits for a response.
  • If the target computer receives it, it responds with a SYN ACK packet. On receiving this, the source computer responds with an ACK packet. This lets both computers know each other is listening, and they can start sending packets.
  • On the other hand, if either the source or target computer don't hear anything after a while, they wait for a while and send again, and wait some more. Every time they have to wait, they wait for twice as long as they did last time, until a maximum wait period has been reached and they abort a connection. This is called exponential backoff, and is key to TCP.

A three-way handshake ensures everyone is ready and willing to listen. However, the fun doesn't stop there:

  • As part of the handshake, the source computer specifies it will fire off an initial certain number of packets, each of a certain size.
  • After the handshake, the source computer fires off the specified packets, and waits for an ACK for every packet sent. If it doesn't receive an ACK for any packet, it goes into exponential backoff before resending that packet
  • Meanwhile, the target computer has been told to await a certain number of packets, so it waits until all of them are in. Packets may arrive out of order, depending on how the intervening networks routers chose to optimise the path for each packet, so each packet is prepended with a certain message indicating their order, and the target computer sorts them together into one neat message.
  • Once the source receives an ACK, it uses the total time taken to see how much it can send next. The better the response time, the more packets TCP is willing to send.

UDP skips the three-way handshake. It only chunks and sends. It is not guaranteed all of your message will get there. It is not guaranteed it will be sent in order (as opposed to received in order). It is perfect for cases where high network reliability means most of your messages will probably arrive, but where it doesn't matter if all of it arrives (e.g . it is okay if some frames in a video don't arrive).

About Video

Video is fundamentally no different from any other content format. It is perfectly possible to use HTTP for videos. Whether it is advisable to use TCP is another matter, but isn't bad - Skype uses both UDP and TCP.

All video consists of a series of bytes. How those bytes are to be interpreted is the job of the encoding. Video can have many encodings: avi and mp4 come readily to mind. With HTTP, you can specify the content encoding as part of the message headers.

HTTP enables compression of content, including for video. HTTP also allows you to request that a connection be kept-alive i.e. that a three-way handshake need not be performed again after a full message has been sent. An extension to HTTP called websockets was developed that effectively use these two features to provide support for real-time video passing. These only optimise the video arrival so it doesn't look laggy, but it doesn't change how the video arrives.

Of course, sometimes you want more guarantees about video, and there are lots and lots of tricks to use to support high-fidelity video in low-speed Internet environments, or enable multiple people to subscribe to a live broadcast, etc. That's when you have to get creative. But otherwise video content is not fundamentally different from any other content type.

To Answer Your Questions

When I visit a site my browser asks for an HTML file to a server, for that my browser creates a socket, binds it to my ip adress, and connects it to a listening socket of the server of the site I am visiting. In order to connect my browser's socket to the server I need a port number and a hostname, the port number is 80 because this is HTTP and the hostname is obtained via DNS resolution. Now that there is a connection between sockets my browser sends a GET request. That request is an ASCII file with the contents corresponding to an HTTP request. My browser writes the ASCII raw bytes to the socket and that is written to the server's socket.

HTTP does not require port 80. It is a convention that port 80 be the default port for HTTP-using servers and 443 for HTTPS, but any port can be used, so long as no other port is occupied.

You do not receive a hostname from DNS. Actually, it's the opposite - you supply a hostname, and retrieve an IP address from DNS. It is the IP address that is used to identify a location on another network.

It is not necessary for the response to be ASCII. Headers, yes, are to be interpreted as ASCII as they are part of an international standard that was developed before UTF-8 gained prominence, but no such restrictions are needed on the body. In fact, the content encoding is traditionally passed along as a header itself, which the browser or a client can use to decode the body content automatically.

The server writes back the HTML file I requested to the socket. The HTML the server sends is just an ASCII file that the server will write byte by byte to the socket.

Yes, except there is no need for it to be ASCII.

My browser recieves the ASCII file and parses it. Lets assume here that it finds an image tag. The browser sends an HTTP request for that image file. Here comes something I don't understand. How does the server respond? As far as I can tell the server must send back an ASCII file formed by a set of headers followed by a CRLF and then the body of the message. In this case, assuming my browser asked for a .jpeg, does the server write the headers as ASCII plaintext to the socket and then writes the raw bytes of the image to the socket?

Yes.

If the HTML file has several images do we open a socket per image (per request)?

See this answer. HTML is always downloaded first before the image requests are fired off, and images are always requested for in the order that they are encountered in the DOM. If you have 24 images on Chrome, 6 of them will be loaded in parallel at a time, meaning four parallel connections.

You can additionally answer this yourself by opening up your Network tab in the Chrome console, and inspecting whether requests for images are fired off in parallel.

Lets assume that my browser now finds a javascript tag. When the server answers to my request for that script does the server writes the ASCII bytes of the source of the script to the socket? What happens with js libraries? Does the server have to send all the source code for each one?

The HTML specification allows you to select what order you want your Javascript files to be downloaded.

Yes, the server writes bytes. The bytes do not need to be ASCII-encoded. The headers will be in ASCII. Yes, the server must send the source code for each library. This is why an important part of web optimisation is minimising your Javascript file sizes and bundling all the libraries into one file, in order to reduce the number and size of requests.

On writing data to the sockets: is write(2) the correct way to do all this writing between sockets?

It is certainly the most basic way to write to an open file descriptor on Linux kernels. Everything in Linux is treated like a file, including sockets, so yes, sockets have file descriptors and can be written to this way.

There are more complex ways of accomplishing this, all of which are referenced in the manual page for write. Most languages have support for writing to sockets, however, by having glue code to manually call write() using a friendlier interface. Perhaps the only time you would need to explicitly call write() in C is if you were writing kernel-level programs or are on embedded hardware.

On the transmission of large files: if I click a button on the site that lets me download a large PDF, how is this accomplished by the server? I assume that the server tries to transmit this in pieces. As far as I can tell there is an option for chunked encoding. Is this the way? If it is, is the file divided into chunks, and these are appended to the ASCII response and written byte by byte into the socket?

See the TCP/IP section I wrote above. The HTTP standard does let you get away with breaking up a message into higher-order chunks before letting TCP chunk it still further, so you can make do with small segments that arrive at a time.

Finally, how is video transmitted?

See the video section I wrote above.

Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
0

HTTP, sockets, streaming and packages transmition are different topics.

HTTP is a communication protocol to request or send data. Sockets are not used regularly by web developers because they are not very network friendly, due to the persistent connection required. How your browser manages the HTTP requests usually should not be a real concern for you.

For big chunks of data like video, streaming is maybe the best technique, because you don't need synchronization between the client and server, or an always active connection like with sockets. The way streaming is done only depends on you and on the language you have on the server to share your content.

If you want to learn more about HTTP, I recommend to you to read a little on RFC's like RFC 7230 or RFC 7231. To understand how data is transmitted you should really know the basis of Abstraction Layers and for video streaming, you might learn how to make one video streaming server with NodeJs, (you might pick another language of your preference), or just search and install an NPM package that already does that job for you.

Community
  • 1
  • 1
Canilho
  • 944
  • 5
  • 11