It is highly recommended to read High-Performance Browser Networking.
About HTTP
HTTP is a message structuring protocol. It can be built on top of TCP/IP, or UDP, or any other communication protocol.
IP solves the problem of figuring out which computer in a network a message is meant to get to, and TCP solves the problem of ensuring the message gets received despite noise interfering. UDP does what TCP does, but without some important guarantees that make it better in some situations, such as video streaming.
HTTP only solves the problem of what the messages should look like so everyone can understand what you mean. An HTTP message consists of a header and a body. The body is the message you want to send; the header contains meta-information about the status of the message itself. HTTP lets you structure your applications in a meaningful, context-oriented way through a standard set of terms.
For example, you can communicate character encodings of your body with HTTP, how long your content is, whether you are okay with receiving it in a compressed format, and so on and so forth. So, no, HTTP is not limited to ASCII texts - you can send UTF-8 encoded characters with BOM markings, or not even specify an encoding at all. All HTTP does is let you ask for things in the way you want it, and inform recipients how you've packaged a message.
The actual thing responsible for handling how your messages are sent rather than structured are TCP/IP and UDP. HTTP has nothing to do with it. Both TCP/IP and UDP add overhead, but are well worth it so that communication can pass through unimpeded.
About Sockets
Computers listen on "sockets", which is just a fancy name to refer to a communication channel. It does not matter what a socket is - it is just a generic name used to refer to a communication channel, be it a wire or a wireless radio. All that matters is what a socket can do. Computers can send bytes down a socket (called flushing), and can read bytes sent through a socket. Sockets always carry a certain amount of memory reserved for incoming messages (like an inbox) called a buffer, and can even bundle many messages together and send them together in one shot to save time.
Sockets at the hardware level usually devolve to a network card, which lets you talk to wireless network, or to an Ethernet cable. Note that the computer may have many more sockets than cables - this is because a socket is a generic name for a single communication channel, and a single network/ Ethernet card can handle multiple communication channels. Being able to handle multiple channels at once is called multiplexing.
TCP/IP and UDP are just blueprints - it is the responsibility of the operating system to actually do as they lay out, and most OSs have some program designed to implement these standards. At the software level, how information is read and written becomes slightly more complicated than just passing bytes since a computer must also be able to interrupt its running programs when a hardware event happens, including while communicating from a socket - here is a reference for how the Linux kernel implements TCP/IP.
All operating systems expose a set of calls to start listening to (bind) a socket, read a socket and write to a socket. You can read from a socket in multiple ways, however. These range from the basic select()
and [poll()
] in most Linux distributions, which force the program to wait until all the data requested for has been received and then read it, to epoll() in Linux as well, which enables a program to ask to be notified when data has been received before having to read it.
Windows exports a completely different set of system calls, so you would be well advised to consult a reference manual for the same if planning to build applications for Windows.
About TCP/IP
TCP/IP is a combination of two protocols that has mostly become the norm for ensuring reliable communication.
IP is responsible for the term IP address. Every computer has a unique address associated with it, specified as either a 32-bit number (IPv4) or a 128-bit number (IPv6, or IP version 6). Note that these addresses do not exist outside of a network: a network is just a collection of computers, and a computer's address only makes sense within that collection. The network that the computer comes from is part of the IP address of a computer; the network itself is given a unique address; and a network may be composed of multiple networks. The IP protocol introduces the concept of a port, which is essentially synonymous with the concept of a socket.
I'm just tossing about the term 'network' willy-nilly as an abstract concept, but physically it boils down to a router. A router is a special computer responsible for figuring out who is being referenced to in a message using the IP address attached to the message, for assigning IP addresses to computers it is aware of (a network is quite literally the set of computers the router knows about), and for forwarding messages to other computers or routers. An internetwork (or just the Internet) is simply a bunch of routers, each with their own network, able to communicate to each other to form one giant network of connected networks. Effectively, a router implements the IP standard.
TCP and UDP are designed to solve another harrowing problem: how to ensure all of your messages get through. Sending any message down a shared communication channel like wireless or even wired channels organised like a bus topology is inherently messy - different messages can overlap, messages can be lost unexpectedly, messages can be corrupted and so on. TCP aims to solve these problems by guaranteeing all of a message goes through. On the other hand, UDP makes no such guarantees, and thus saves time by skipping a lot of steps TCP does.
TCP and UDP chunk the message into packets of a certain size, so that a message can be sent out as quickly as possible. TCP further adds some additional structure to the exchange called a three-way handshake:
- It sends off a TCP-specific message called a SYN packet to the computer it wants to send a message to, and waits for a response.
- If the target computer receives it, it responds with a SYN ACK packet. On receiving this, the source computer responds with an ACK packet. This lets both computers know each other is listening, and they can start sending packets.
- On the other hand, if either the source or target computer don't hear anything after a while, they wait for a while and send again, and wait some more. Every time they have to wait, they wait for twice as long as they did last time, until a maximum wait period has been reached and they abort a connection. This is called exponential backoff, and is key to TCP.
A three-way handshake ensures everyone is ready and willing to listen. However, the fun doesn't stop there:
- As part of the handshake, the source computer specifies it will fire off an initial certain number of packets, each of a certain size.
- After the handshake, the source computer fires off the specified packets, and waits for an ACK for every packet sent. If it doesn't receive an ACK for any packet, it goes into exponential backoff before resending that packet
- Meanwhile, the target computer has been told to await a certain number of packets, so it waits until all of them are in. Packets may arrive out of order, depending on how the intervening networks routers chose to optimise the path for each packet, so each packet is prepended with a certain message indicating their order, and the target computer sorts them together into one neat message.
- Once the source receives an ACK, it uses the total time taken to see how much it can send next. The better the response time, the more packets TCP is willing to send.
UDP skips the three-way handshake. It only chunks and sends. It is not guaranteed all of your message will get there. It is not guaranteed it will be sent in order (as opposed to received in order). It is perfect for cases where high network reliability means most of your messages will probably arrive, but where it doesn't matter if all of it arrives (e.g . it is okay if some frames in a video don't arrive).
About Video
Video is fundamentally no different from any other content format. It is perfectly possible to use HTTP for videos. Whether it is advisable to use TCP is another matter, but isn't bad - Skype uses both UDP and TCP.
All video consists of a series of bytes. How those bytes are to be interpreted is the job of the encoding. Video can have many encodings: avi
and mp4
come readily to mind. With HTTP, you can specify the content encoding as part of the message headers.
HTTP enables compression of content, including for video. HTTP also allows you to request that a connection be kept-alive i.e. that a three-way handshake need not be performed again after a full message has been sent. An extension to HTTP called websockets was developed that effectively use these two features to provide support for real-time video passing. These only optimise the video arrival so it doesn't look laggy, but it doesn't change how the video arrives.
Of course, sometimes you want more guarantees about video, and there are lots and lots of tricks to use to support high-fidelity video in low-speed Internet environments, or enable multiple people to subscribe to a live broadcast, etc. That's when you have to get creative. But otherwise video content is not fundamentally different from any other content type.
To Answer Your Questions
When I visit a site my browser asks for an HTML file to a server, for
that my browser creates a socket, binds it to my ip adress, and
connects it to a listening socket of the server of the site I am
visiting. In order to connect my browser's socket to the server I need
a port number and a hostname, the port number is 80 because this is
HTTP and the hostname is obtained via DNS resolution. Now that there
is a connection between sockets my browser sends a GET request. That
request is an ASCII file with the contents corresponding to an HTTP
request. My browser writes the ASCII raw bytes to the socket and that
is written to the server's socket.
HTTP does not require port 80. It is a convention that port 80 be the default port for HTTP-using servers and 443 for HTTPS, but any port can be used, so long as no other port is occupied.
You do not receive a hostname from DNS. Actually, it's the opposite - you supply a hostname, and retrieve an IP address from DNS. It is the IP address that is used to identify a location on another network.
It is not necessary for the response to be ASCII. Headers, yes, are to be interpreted as ASCII as they are part of an international standard that was developed before UTF-8 gained prominence, but no such restrictions are needed on the body. In fact, the content encoding is traditionally passed along as a header itself, which the browser or a client can use to decode the body content automatically.
The server writes back the HTML file I requested to the socket. The
HTML the server sends is just an ASCII file that the server will write
byte by byte to the socket.
Yes, except there is no need for it to be ASCII.
My browser recieves the ASCII file and parses it. Lets assume here
that it finds an image tag. The browser sends an HTTP request for that
image file. Here comes something I don't understand. How does the
server respond? As far as I can tell the server must send back an
ASCII file formed by a set of headers followed by a CRLF and then the
body of the message. In this case, assuming my browser asked for a
.jpeg, does the server write the headers as ASCII plaintext to the
socket and then writes the raw bytes of the image to the socket?
Yes.
If the HTML file has several images do we open a socket per image (per
request)?
See this answer. HTML is always downloaded first before the image requests are fired off, and images are always requested for in the order that they are encountered in the DOM. If you have 24 images on Chrome, 6 of them will be loaded in parallel at a time, meaning four parallel connections.
You can additionally answer this yourself by opening up your Network tab in the Chrome console, and inspecting whether requests for images are fired off in parallel.
Lets assume that my browser now finds a javascript tag. When the
server answers to my request for that script does the server writes
the ASCII bytes of the source of the script to the socket? What
happens with js libraries? Does the server have to send all the source
code for each one?
The HTML specification allows you to select what order you want your Javascript files to be downloaded.
Yes, the server writes bytes. The bytes do not need to be ASCII-encoded. The headers will be in ASCII. Yes, the server must send the source code for each library. This is why an important part of web optimisation is minimising your Javascript file sizes and bundling all the libraries into one file, in order to reduce the number and size of requests.
On writing data to the sockets: is write(2) the correct way to do all
this writing between sockets?
It is certainly the most basic way to write to an open file descriptor on Linux kernels. Everything in Linux is treated like a file, including sockets, so yes, sockets have file descriptors and can be written to this way.
There are more complex ways of accomplishing this, all of which are referenced in the manual page for write
. Most languages have support for writing to sockets, however, by having glue code to manually call write()
using a friendlier interface. Perhaps the only time you would need to explicitly call write()
in C is if you were writing kernel-level programs or are on embedded hardware.
On the transmission of large files: if I click a button on the site
that lets me download a large PDF, how is this accomplished by the
server? I assume that the server tries to transmit this in pieces. As
far as I can tell there is an option for chunked encoding. Is this the
way? If it is, is the file divided into chunks, and these are appended
to the ASCII response and written byte by byte into the socket?
See the TCP/IP section I wrote above. The HTTP standard does let you get away with breaking up a message into higher-order chunks before letting TCP chunk it still further, so you can make do with small segments that arrive at a time.
Finally, how is video transmitted?
See the video section I wrote above.